BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
213 stars 71 forks source link

Extremely high cell barcode associations #355

Open bayleaflet opened 4 months ago

bayleaflet commented 4 months ago

I have read the paper (https://doi.org/10.1038/s41467-020-15171-6)
and the manual (https://flair.readthedocs.io/en/latest/) and I still have a question about

Thank you for creating this easy to use module! I am currently doing analysis on scRNA long reads, which includes a cell barcode, UMI, and read ID. For example: AACTGGTCAATGGTCT_ATGCCGAGGG#dbee6b69-f357-431e-a5ae-8a6133becf51_+1of1-0_1:14000 The cell is associated with the beginning portion, the second portion is the UMI, which is associated with the transcript, and the final portion is in regards to the read_id and chromosome start.

My goal is to load the data into a matrix, create a Seurat object with this information, and generate isoform expression change plots per cell type. I have successfully completed this with two other pipelines (flames, custom).

Initially, I came to the conclusion that the file I should use to create the matrix is flair.collapse.combined.isoform.read.map.txt, as it contains the the ENST_ENSG code, and all associated cells per transcript. However, when I create the matrix I end up with extremely high amount of cells per transcriptid geneid combination. I verified the counts in this matrix to ensure my script is accurate, and it was, therefore I believe I am misunderstanding something fundamental.

Screenshot 2024-07-31 at 10 59 37 AM

When I filter out records that contain a gene id on flair.collapse.isoform.read.map.txt, I get a much sparser matrix.

Screenshot 2024-07-31 at 11 07 34 AM

Essentially, I am wondering the difference between flair.collapse.combined.isoform.read.map.txt, flair.collapse.isoform.read.map.txt, and flair.collapse.annotated_transcripts.isoform.read.map.txt in a biological context. Yes, I do understand the annotation differences between these files, but I am uncertain as to how the counts could be so extremely high. If you could provide insight on this it would be extremely helpful! Thank you!