COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

GTF based salmon index #569

Closed parvathisudha closed 3 years ago

parvathisudha commented 4 years ago

Hi Salmon team,

Thanks for the tool! I am trying to create the salmon index for GRCh38 using Gencode. When I did the quantification, even though I added GTF file, "quant.genes.sf is only showing the Transcript id's not gene id's. Can you please tell me how to solve this issue? Is there a way to create a GTF based salmon index file for GRCh38?

Thanks Parvtahi.

yeodynasty commented 3 years ago

I have a similar problem. Attached are:

  1. gtf file, where clearly, the gene_ id and transcript_id are provided
  2. quant files are as followed for gene and transcripts
  3. my command as followed:

/gpfsdata/apps/salmon-latest_linux_x86_64/bin/salmon quant \ -i /gpfshome/hockchuan/SALMON/GCF_900626175.2_cs10_index \ -l ISR \ -1 /gpfsdata/JangiLab/hockchuan/170302/2.Trimmomatic_output/clean_HEADBANDSTEM_1.fastq.gz \ -2 /gpfsdata/JangiLab/hockchuan/170302/2.Trimmomatic_output/clean_HEADBANDSTEM_2.fastq.gz \ --seqBias \ --gcBias \ --posBias \ --incompatPrior 0.0 \ --geneMap /gpfsdata/JangiLab/hockchuan/cs10_reference_genome/GCF_900626175.2_cs10_genomic.gtf \ --recoverOrphans \ --allowDovetail \ --threads $NSLOTS \ --dumpEq \ --minScoreFraction 0.65 \ --writeMappings /gpfshome/hockchuan/SALMON/MAP/HEADBANDSTEM \ --fldMean 250 \ --fldSD 25 \ --writeOrphanLinks \ --writeUnmappedNames \ --quiet \ -o /gpfshome/hockchuan/SALMON/HEADBANDSTEM_quant

fewLines.gtf.txt quant.genes.txt quant.txt

yeodynasty commented 3 years ago

Any idea what went wrong?

rob-p commented 3 years ago

Is there any output to the terminal when salmon is running that would suggest it couldn't interpret the GTF properly? Can you share the salmon log file?

parvathisudha commented 3 years ago

Any idea what went wrong?

I had that issue while using Salmon version 0.9.1. Once I upgraded the salmon version, I could index the genome properly and I obtained gene-level TPM for the samples.

yeodynasty commented 3 years ago

This is the initial output log, where it reports an inccorrect gene annotation:


Version Info: This is the most recent version of salmon.

| Loading contig table | Time = 13.512 s

size = 16145665

| Loading contig offsets | Time = 382.03 ms


| Loading reference lengths | Time = 9.4861 ms


| Loading mphf table | Time = 2.4236 s

size = 1057188904 Number of ones: 16145664 Number of ones per inventory item: 512 Inventory entries filled: 31535

| Loading contig boundaries | Time = 4.031 s

size = 1057188904

| Loading sequence | Time = 1.983 s

size = 572818984

| Loading positions | Time = 14.658 s

size = 942318702

| Loading reference sequence | Time = 1.4932 s


| Loading reference accumulative lengths | Time = 10.959 ms

Error: invalid feature coordinates (end<start!) at line: NC_029855.1 RefSeq gene 406748 107842 . + . gene_id "A5N79_gp28"; db_xref "GeneID:27215502"; exception "trans-splicing"; gbkey "Gene"; gene "nad2"; gene_biotype "protein_coding"; locus_tag "A5N79_gp28";


After I remove the erroneous entry, there is no more complaint:


Version Info: This is the most recent version of salmon.

| Loading contig table | Time = 14.648 s

size = 16145665

| Loading contig offsets | Time = 336.77 ms


| Loading reference lengths | Time = 10.195 ms


| Loading mphf table | Time = 2.3113 s

size = 1057188904 Number of ones: 16145664 Number of ones per inventory item: 512 Inventory entries filled: 31535

| Loading contig boundaries | Time = 4.881 s

size = 1057188904

| Loading sequence | Time = 1.7554 s

size = 572818984

| Loading positions | Time = 13.626 s

size = 942318702

| Loading reference sequence | Time = 1.5082 s


| Loading reference accumulative lengths | Time = 12.272 ms


However, the *.sf files are the same as previous ones, i.e. no gene level results.

yeodynasty commented 3 years ago

Interestingly, the same gtf file can be used to obtain gene level counts using HTSeq-count in OmicsBox, suggesting that the file is working fine.

yeodynasty commented 3 years ago

I have used tximport to complete the job instead.

rob-p commented 3 years ago

Thanks for letting us know. Tximport is the recommended way to accomplish this.