Open antonkulaga opened 3 years ago
I looked into this, since I'm working on a pipeline that could encounter a similar issue, the problem is that in the ensembl .fq files the transcript IDs are represented as the standard ID.version format, e.g. ENST00000415118.1 but in the GTF, they're represented as separate fields:
gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
This is different than the Gencode GTFs where there is not a separate version field and the transcript ids match exactly between the GTF and Fq without parsing any additional fields:
gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1"; gene_type "TEC"; gene_name "4933401J01Rik"; transcript_type "TEC"; transcript_name "4933401J01Rik-201"; level 2; transcript_support_level "NA"; mgi_id "MGI:1918292"; tag "basic"; havana_gene "OTTMUSG00000049935.1"; havana_transcript "OTTMUST00000127109.1";
This is also going to affect things like the feature request I made here: https://github.com/COMBINE-lab/salmon/issues/595
So it seems that any attempt at GTF parsing is going to need to check for those atypical gene_version and transcript_version fields and attempt to tack on that information to the base ID.
The following bash code will detect and parse either format of gtf into an appropriately versioned two column tx2gene file
test=$(zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | head -n 1| cut -f9 | tr -s ";" " " | awk '{print$3}' | sort | uniq | sed 's/\"//g') if [[ $test == "transcript_id" ]]; then zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$4"\t"$2}' | sort | uniq | sed 's/\"//g' > txp2gene.tsv elif [[ $test == "gene_version" ]]; then echo "Separate version field (ensembl, non-gencode transcriptome, eg. rat, etc)" zless -S $gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$6 "." $8"\t"$2 "." $4}' | sort | uniq | sed 's/\"//g' > txp2gene.tsv fi
hello! I am having a similar issues and was wondering if you had an answer for this. My salmon quant.gene.sf file have the ID.version format, and when trying to do DEG using BiomaRt the gene ID are not in the ID.version format, thus I am unable to match them. I did notice that the index used for salmon was GRCm38 but the one currently available in BiomaRt is GRCm39. Do you know if i could cut the .version from my quan.gene.sf file, or I have to run salmon again with the current genome? thank you.
I noticed that for some reason salmon does not convert transcript ids to gene ids. Here is the command that I run:
I also enclose the results. Even though I use Ensembl Human GTF ( i.e. ftp://ftp.ensembl.org/pub/release-101/gtf/homo_sapiens/Homo_sapiens.GRCh38.101.gtf.gz ) it gives me transcript ids in quant.genes.sf output quant.zip Downloads.tar.gz