chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
34 stars 22 forks source link

Addition of Ensembl transcript ID as valid feature variable #702

Open jychien opened 7 months ago

jychien commented 7 months ago

I recently attended HCA Asia, and an attendee requested that Ensembl transcript IDs would be accepted as feature variables for the count matrix, especially for Smart-Seq data. He noted that the reason for running these labor intensive full length assays is to have the splice variant counts capture by Ensembl transcript IDs, and that it is especially significant in disease data. The information loss is unfortunate, and he hopes that we can keep the transcript expression.

During the conference, Shyam also noted that with 10x 5' assays, you are able to obtain additional splice information. He didn't go into detail, but I can try and follow-up if there is interest. Here is additional information from 10x website (Although, I don't think this is what Shyam was referring to):

BAevermann commented 7 months ago

I was also having conversations about this with grantees at the annual meeting. My questions are:

-Are we getting enough smart-seq data to worry about this? -Does the 5' data gives us enough transcript signal to warrant support?

In addition, there is a renewed interest in long read single cell applications. A number of groups are piloting new PacBio or Nanopores kits in this area. If the data looks good and it is not prohibitively expensive, this could be a big growth area in late 2024 or early 2025.

jychien commented 5 months ago

Great questions.

In general, full length assays are information rich, and can help distinguish things such as membrane bound vs soluble isoforms of CTLA4. Would be great to get more of this data into the data corpus at the transcript/splice variant granularity.