Open joycekang opened 3 years ago
Hi Joyce,
Yes, it is necessary to add a transcript_id to each of the "exon" lines in the GTF. These transcript IDs need to be unique for all sequences unless they belong to the same transcript (i.e. spliced together). For Solo counting, each read is checked for concordance with the transcripts, so it's not intended to work if transcript_id is not specified. In your scenario, some reads will multimap to different alleles of the same gene. If you label the alleles with the same gene_id (but different transcript_id), all reads will be counted towards this gene, as unique-gene reads. If you label the alleles with different gene_ids, the read will be considered "multi-gene", and only EM options will count them - but they will end up distributed among different gene_ids. There is an issue that may be related: #1398 as in both cases the .gtf file are modified with "overlapping" exons. I am checking to see if there is a bug affecting the results.
Cheers Alex
Hi Alex,
Thanks for your help a few weeks ago. I'm actually working with Amber on this and have a follow-up question: as a refresher, our setup is that we add custom sequences (different alleles of the same gene) as extra chromosomes to our reference
.fa
file and add custom entries to our.gtf
file corresponding to those sequences, labeling them as 'exon' and supplying a 'gene_id'. We run with-EM
to allow multimapping.gene_id "G1"
andgene_id "G2"
, we get different results (different final gene quantifications) than if we label them asgene_id "G"
andgene_id "G"
. Technically, these are the same gene (and we want to be able to count them as 1 gene in the final genesXcells matrix), but they are two different alleles of the gene and therefore added as two separate chromosomes for alignment. Do you recommend naming them with unique or the same gene_ids? Thanks so much!Originally posted by @joycekang in https://github.com/alexdobin/STAR/issues/1362#issuecomment-961513257