Open sydjo07 opened 7 months ago
Hi, thanks for raising this. This is a known issue with how stringtie assigned gene_name as a unique identifier that we have a plan to fix/look in to hopefully by the next release. See -
Disreprency in counts between MSTRG genes and nonMSTRG genes · Issue #206 · gpertea/stringtie
Hi Sarah, thanks for your help! I didn't realize this was a known issue but thanks for pointing me in the right direction.
Sorry for the delay, this is still on our radar, will hopefully have an improvement soon.
Thanks, I appreciate it! I've been able to work around this a bit because I noticed that the unfiltered_tpm_transcript_counts.tsv and the unfiltered_transcript_counts_with_genes.tsv files contain both the proper annotation and their associated MSTRG annotations. I've been able to merge the annotations with the results_dge.tsv to get the proper gene name associations in most cases, although it's not perfect and I know I miss some.
Also to clarify, does the unfiltered_transcript_counts_with_genes.tsv file contain the raw counts before filtering and normalization? If so, then I should be able to use this file as input to EdgeR and generate my own DEG list since it contains the MSTRG to feature_id associations?
Hi, correct it is before filtering and normalisation so you could use it with EdgeR - we are still working on this, haven't got round to it yet but will do soon!
I have a question further to this issue, perhaps either of you can help? I notice that there seems to be three types of genes in the de_analysis output:
Are my assumptions correct or am I missing something? I am not sure the third classification makes sense, but I am also not sure how else it could come about. Thanks
Yes i think your assumptions are correct, we will aim to make this clearer in the documentation in future
Ask away!
This is my first time running this workflow and my DGE analysis tsv files (for example results_dge.tsv) aren't incorporating the gene names in the gene_name column. Instead, the files display NA in the gene_name column and MSTRG in the gene_id column. In the output html file, the gene names are displaying correctly under the differential gene analysis table. Is there a reason that it's omitting this from the tsv output files but not the html output?
For reference, I am working with a non-conventional yeast strain so I used non-publicly available reference genome and annotation files. However, I also tested with genome/ annotation files from NCBI for a different strain of the same organism and found the same issue. When running the test dataset, I found the gene names displaying correctly in the tsv files.