HumanCellAtlas / secondary-analysis

Secondary Analysis Service of the Human Cell Atlas Data Coordination Platform
https://pipelines.data.humancellatlas.org/ui/
BSD 3-Clause "New" or "Revised" License
3 stars 2 forks source link

Optimus should output gene ids, not gene names #619

Closed mckinsel closed 5 years ago

mckinsel commented 5 years ago

In the optimus zarr output, the gene_id array contains gene names. The other pipelines use gene ids, which is what the matrix service is expecting. Also, gene ids are unique whereas i think the gene names may not be.

The name vs id seems to come from the GE tag created with the TagReadWithGeneExon tool from dropseqtools.

brianraymor commented 5 years ago

@kishorikonwar - calling your attention to this issue that's blocking a Q2 epic.

barkasn commented 5 years ago

@mckinsel Thanks for bringing this to our attention! I understand that we also need to make sure that the ensembl ids we output are the versioned type (i.e. end in .1, .2, ...) @kbergin

mckinsel commented 5 years ago

@barkasn yeah it's probably best if you have the versions in there, though tbh the matrix service currently ignores them based on the assumption that the gencode versions are kept in sync.

kbergin commented 5 years ago

Thanks @mckinsel! Will see this gets prioritized. Does this block the matrix service being able to handle Optimus outputs?

mckinsel commented 5 years ago

Yes it's currently blocking our loading of optimus bundles.

kbergin commented 5 years ago

Update on progress: @kishorikonwar is working on a PR to update Optimus to output gene ids in addition to gene names. Infrastructure is aware of the incoming update and will be prepared to create a new integration test bundle for both human and mouse when the update is released. Kishori will put the PR here soon and we will get it reviewed on Monday.

cc @jkaneria @brianraymor