Output DGE analysis files omit gene names

sydjo07 commented 7 months ago

Ask away!

This is my first time running this workflow and my DGE analysis tsv files (for example results_dge.tsv) aren't incorporating the gene names in the gene_name column. Instead, the files display NA in the gene_name column and MSTRG in the gene_id column. In the output html file, the gene names are displaying correctly under the differential gene analysis table. Is there a reason that it's omitting this from the tsv output files but not the html output?

For reference, I am working with a non-conventional yeast strain so I used non-publicly available reference genome and annotation files. However, I also tested with genome/ annotation files from NCBI for a different strain of the same organism and found the same issue. When running the test dataset, I found the gene names displaying correctly in the tsv files.

sarahjeeeze commented 7 months ago

Hi, thanks for raising this. This is a known issue with how stringtie assigned gene_name as a unique identifier that we have a plan to fix/look in to hopefully by the next release. See -

prepDE.py pulls around 50% MSTRG as gene_id from Stringtie_merge RNA-Seq · Issue #179 · gpertea/stringtie

Disreprency in counts between MSTRG genes and nonMSTRG genes · Issue #206 · gpertea/stringtie

sydjo07 commented 7 months ago

Hi Sarah, thanks for your help! I didn't realize this was a known issue but thanks for pointing me in the right direction.

sarahjeeeze commented 6 months ago

Sorry for the delay, this is still on our radar, will hopefully have an improvement soon.

sydjo07 commented 6 months ago

Thanks, I appreciate it! I've been able to work around this a bit because I noticed that the unfiltered_tpm_transcript_counts.tsv and the unfiltered_transcript_counts_with_genes.tsv files contain both the proper annotation and their associated MSTRG annotations. I've been able to merge the annotations with the results_dge.tsv to get the proper gene name associations in most cases, although it's not perfect and I know I miss some.

Also to clarify, does the unfiltered_transcript_counts_with_genes.tsv file contain the raw counts before filtering and normalization? If so, then I should be able to use this file as input to EdgeR and generate my own DEG list since it contains the MSTRG to feature_id associations?

sarahjeeeze commented 5 months ago

Hi, correct it is before filtering and normalisation so you could use it with EdgeR - we are still working on this, haven't got round to it yet but will do soon!

kfletcherelo commented 3 months ago

I have a question further to this issue, perhaps either of you can help? I notice that there seems to be three types of genes in the de_analysis output:

gene_id = gene_name & gene_id ~ /MSTRG/
gene_id ~ /MSTRG/ & gene_name = NULL
gene_id & gene_name match entries in provided annotation My assumption is that:
was assembled by stringtie and not present in the annotation - sequence can be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID
was assembled by stringtie and was present in the annotation - sequence cannot be extracted from final_non_redundant_transcriptome.fasta using the stringtie ID, instead ID should be found in unfiltered_transcript_counts_with_genes.tsv
was not assembled by stringtie but had reads mapping to it for DGE so GFF id was used?

Are my assumptions correct or am I missing something? I am not sure the third classification makes sense, but I am also not sure how else it could come about. Thanks

sarahjeeeze commented 3 weeks ago

Yes i think your assumptions are correct, we will aim to make this clearer in the documentation in future

epi2me-labs / wf-transcriptomes

Output DGE analysis files omit gene names #86

Ask away!