Closed Nicolai-vKuegelgen closed 2 years ago
Hi Nicolai,
The issue is not about how to aggregate transcripts count to the gene-level - maybe the two approaches are using different algorithms for aggregation, which I have to read about further. The issue here is rather the discrepancy between the annotation of transcript ids between the cDNA files and the GTF files of the ensembl database (even when using the same version of ensembl). Whether using tximport or salmon's -g option, the user has to provide the correct mapping of transcript ids and gene ids to make sure the mapping of gene ids (thus the count aggregation) works. I will follow this up to see if there is a universal solution to this id mapping issue and will update the pipeline accordingly.
The short term solution I find here is to provide a cDNA file without the version numbers appended to the transcript ids. Then it will all work out.
Something like this would work for cDNA files from ENSEMBL:
sed 's/\.[0-9]*.*//g' <input_cdna_file> > cdna_file.corrected.fasta
Thanks for the feedback!
I've used the following command now (to make sure only transcript & gene Ids are changed):
sed 's/\(ENSMUS[TG][0-9]*\)\.[0-9]*/\1/g' Mus_musculus.GRCm38.cdna.all.fa > Mus_musculus.GRCm38.cdna.all_cleaned.fa
Annotation files are now checked before running the alignments/indexing. For the mismatches between the cDNA fasta and GTF transcript ids, something similar to the above sed
command is suggested to the user in case the user is using Ensembl annotations.
https://github.com/BIMSBbioinfo/pigx_rnaseq/issues/56#issuecomment-1083152394
The DEseq results based on salmon (either gene or transcript) counts don't contain any gene names (the respective column has only NA values). This seems to be caused by a difference in the gtf and cdna.fa files from Ensembel: the gtf files only include the {transcript_id}, while the fasta files include {transcript_id.version}.
The problem already results in a warning from Salmon, when it collapses the transcript based counts back to genes:
[2018-04-18 18:32:39.399] [jointLog] [warning] couldn't find transcript named [ENSMUST00000103685.2] in transcript <-> gene map; returning transcript as it's own gene
It also seems that the developers of salmon recommend to use tximport (which they also develop) for gene level counts over the '-g' option. I have tried to figure out what the difference between those two ways of collapsing counts to genes actually is (if there is any), but I couldn't find anything ... Also, it seems that using tximport seems to take care of the issue with gene_name mapping (link1, link2).