AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Docs request: FAQ about gene id conversion #837

Open jashapiro opened 1 month ago

jashapiro commented 1 month ago

What is the documentation improvement or update you wish to see?

Many previous single-cell analyses use gene symbols as their primary identifiers, while we use the more stable Ensembl IDs. We should provide clear instructions on how to convert between identifiers. We should be sure to note that such conversions should be done before conversion to Seurat (if that conversion is required), and how to deal with duplicate gene symbols.

Is there any additional context you would like to provide?

We provide the gene symbols within our SCE objects, but we do not do anything to deal with duplicate gene symbols. We should carefully investigate what conversion we want to recommend: do we take the values from the "first" gene id, the gene id with the highest expression, or perhaps sum expression across the duplicated set?

The proper investigation of this question might require looking at CellRanger default references (https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz ) to determine which gene annotations were used there, which should allow us to make the best recommendations.

If this request is for an existing documentation page, please provide the link here.

https://openscpca.readthedocs.io/en/latest/troubleshooting-faq/faq/

jashapiro commented 1 month ago

This is kind of a duplicate of https://github.com/AlexsLemonade/scpca-docs/issues/346, but maybe we start here?