BIMSBbioinfo / pigx_rnaseq

Bulk RNA-seq Data Processing, Quality Control, and Downstream Analysis Pipeline
GNU General Public License v3.0
20 stars 11 forks source link

Annotation of salmon based counts #35

Closed Nicolai-vKuegelgen closed 2 years ago

Nicolai-vKuegelgen commented 6 years ago

The DEseq results based on salmon (either gene or transcript) counts don't contain any gene names (the respective column has only NA values). This seems to be caused by a difference in the gtf and cdna.fa files from Ensembel: the gtf files only include the {transcript_id}, while the fasta files include {transcript_id.version}.

The problem already results in a warning from Salmon, when it collapses the transcript based counts back to genes:

[2018-04-18 18:32:39.399] [jointLog] [warning] couldn't find transcript named [ENSMUST00000103685.2] in transcript <-> gene map; returning transcript as it's own gene

It also seems that the developers of salmon recommend to use tximport (which they also develop) for gene level counts over the '-g' option. I have tried to figure out what the difference between those two ways of collapsing counts to genes actually is (if there is any), but I couldn't find anything ... Also, it seems that using tximport seems to take care of the issue with gene_name mapping (link1, link2).

borauyar commented 6 years ago

Hi Nicolai,

The issue is not about how to aggregate transcripts count to the gene-level - maybe the two approaches are using different algorithms for aggregation, which I have to read about further. The issue here is rather the discrepancy between the annotation of transcript ids between the cDNA files and the GTF files of the ensembl database (even when using the same version of ensembl). Whether using tximport or salmon's -g option, the user has to provide the correct mapping of transcript ids and gene ids to make sure the mapping of gene ids (thus the count aggregation) works. I will follow this up to see if there is a universal solution to this id mapping issue and will update the pipeline accordingly.

The short term solution I find here is to provide a cDNA file without the version numbers appended to the transcript ids. Then it will all work out.

Something like this would work for cDNA files from ENSEMBL:

sed 's/\.[0-9]*.*//g' <input_cdna_file>  > cdna_file.corrected.fasta

Thanks for the feedback!

borauyar commented 6 years ago

Annotation of salmon based counts

Nicolai-vKuegelgen commented 6 years ago

I've used the following command now (to make sure only transcript & gene Ids are changed):

sed 's/\(ENSMUS[TG][0-9]*\)\.[0-9]*/\1/g' Mus_musculus.GRCm38.cdna.all.fa > Mus_musculus.GRCm38.cdna.all_cleaned.fa

borauyar commented 2 years ago

56

borauyar commented 2 years ago

Annotation files are now checked before running the alignments/indexing. For the mismatches between the cDNA fasta and GTF transcript ids, something similar to the above sed command is suggested to the user in case the user is using Ensembl annotations. https://github.com/BIMSBbioinfo/pigx_rnaseq/issues/56#issuecomment-1083152394