COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
771 stars 161 forks source link

salmon gene-map does not work properly with ensembl GTF-s #598

Open antonkulaga opened 3 years ago

antonkulaga commented 3 years ago

I noticed that for some reason salmon does not convert transcript ids to gene ids. Here is the command that I run:

salmon --no-version-check quant -i /cromwell-executions/quant_sample/04b46951-bd0b-4dd2-adf8-c16c57a86695/call-quant_run/shard-0/quant_run/e5d5f98a-bcee-4249-ba3b-bf6b63571f5d/call-salmon/inputs/446511232/GRCh38.p13_ensembl_101 --geneMap /cromwell-executions/quant_sample/04b46951-bd0b-4dd2-adf8-c16c57a86695/call-quant_run/shard-0/quant_run/e5d5f98a-bcee-4249-ba3b-bf6b63571f5d/call-salmon/inputs/-2062771642/Homo_sapiens.GRCh38.101.gtf --numBootstraps 128 --threads 3 -l A --seqBias --gcBias --validateMappings --writeUnmappedNames -o quant_GSE113798_GSE113801_GSM3119577_SRR7075087 \
-1 /cromwell-executions/quant_sample/04b46951-bd0b-4dd2-adf8-c16c57a86695/call-quant_run/shard-0/quant_run/e5d5f98a-bcee-4249-ba3b-bf6b63571f5d/call-salmon/inputs/-11815/SRR7075087_1.fastq_cleaned.fastq.gz -2 /cromwell-executions/quant_sample/04b46951-bd0b-4dd2-adf8-c16c57a86695/call-quant_run/shard-0/quant_run/e5d5f98a-bcee-4249-ba3b-bf6b63571f5d/call-salmon/inputs/-11815/SRR7075087_2.fastq_cleaned.fastq.gz

I also enclose the results. Even though I use Ensembl Human GTF ( i.e. ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz ) it gives me transcript ids in quant.genes.sf output quant.zip Downloads.tar.gz

ACastanza commented 3 years ago

I looked into this, since I'm working on a pipeline that could encounter a similar issue, the problem is that in the ensembl .fq files the transcript IDs are represented as the standard ID.version format, e.g. ENST00000415118.1 but in the GTF, they're represented as separate fields:

gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";

This is different than the Gencode GTFs where there is not a separate version field and the transcript ids match exactly between the GTF and Fq without parsing any additional fields:

gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";

This is also going to affect things like the feature request I made here: https://github.com/COMBINE-lab/salmon/issues/595

So it seems that any attempt at GTF parsing is going to need to check for those atypical gene_version and transcript_version fields and attempt to tack on that information to the base ID.

ACastanza commented 3 years ago

The following bash code will detect and parse either format of gtf into an appropriately versioned two column tx2gene file

test=$(zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | head -n 1| cut -f9 | tr -s ";" " " | awk '{print$3}' | sort | uniq | sed 's/\"//g') if [[ $test == "transcript_id" ]]; then zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$4"\t"$2}' | sort | uniq | sed 's/\"//g' > txp2gene.tsv elif [[ $test == "gene_version" ]]; then echo "Separate version field (ensembl, non-gencode transcriptome, eg. rat, etc)" zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$6 "." $8"\t"$2 "." $4}' | sort | uniq | sed 's/\"//g' > txp2gene.tsv fi

vanealizo commented 2 years ago

hello! I am having a similar issues and was wondering if you had an answer for this. My salmon quant.gene.sf file have the ID.version format, and when trying to do DEG using BiomaRt the gene ID are not in the ID.version format, thus I am unable to match them. I did notice that the index used for salmon was GRCm38 but the one currently available in BiomaRt is GRCm39. Do you know if i could cut the .version from my quan.gene.sf file, or I have to run salmon again with the current genome? thank you.