Using the same or different 'gene_id' when supplying different alleles of the same gene as extra chromosomes

Hi Alex,

Thanks for your help a few weeks ago. I'm actually working with Amber on this and have a follow-up question: as a refresher, our setup is that we add custom sequences (different alleles of the same gene) as extra chromosomes to our reference .fa file and add custom entries to our .gtf file corresponding to those sequences, labeling them as 'exon' and supplying a 'gene_id'. We run with -EM to allow multimapping.

We currently do not supply a 'transcript_id' and see that we get UMI counts going to these genes, so we're not sure what the 'transcript_id' is used for (should we include it for some reason as well?).
We notice that we get different results if we label the extra chromosomes with unique gene_id names or the same name for different alleles of the same gene. For example, say we add 2 sequences corresponding to two alleles of gene G. Then, if we label those as gene_id "G1" and gene_id "G2", we get different results (different final gene quantifications) than if we label them as gene_id "G" and gene_id "G". Technically, these are the same gene (and we want to be able to count them as 1 gene in the final genesXcells matrix), but they are two different alleles of the gene and therefore added as two separate chromosomes for alignment. Do you recommend naming them with unique or the same gene_ids? Thanks so much!

Originally posted by @joycekang in https://github.com/alexdobin/STAR/issues/1362#issuecomment-961513257

Hi Joyce,

Yes, it is necessary to add a transcript_id to each of the "exon" lines in the GTF. These transcript IDs need to be unique for all sequences unless they belong to the same transcript (i.e. spliced together). For Solo counting, each read is checked for concordance with the transcripts, so it's not intended to work if transcript_id is not specified. In your scenario, some reads will multimap to different alleles of the same gene. If you label the alleles with the same gene_id (but different transcript_id), all reads will be counted towards this gene, as unique-gene reads. If you label the alleles with different gene_ids, the read will be considered "multi-gene", and only EM options will count them - but they will end up distributed among different gene_ids. There is an issue that may be related: #1398 as in both cases the .gtf file are modified with "overlapping" exons. I am checking to see if there is a bug affecting the results.

Cheers Alex

alexdobin / STAR

Using the same or different 'gene_id' when supplying different alleles of the same gene as extra chromosomes #1399