Indexing failed, decoy sequences not matching

COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

https://combine-lab.github.io/salmon

GNU General Public License v3.0

769 stars 161 forks source link

Indexing failed, decoy sequences not matching #788

Open GiwaAO opened 2 years ago

GiwaAO commented 2 years ago

Hello,

I tried creating a salmon index for bos taurus but was not successful. I created the decoy file using:

grep "^>" <(gunzip -c Bos_taurus.ARS-UCD1.2.dna.toplevel.fa.gz) | cut -d " " -f 1 > decoys.txt sed -i.bak -e 's/>//g' decoys.txt

When i try to index using salmon index -t bos_taurus_gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode OR salmon index -t Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz -i bos_taurus_107_index --decoys decoys.txt -k 31

I get an error. The last two lines of the log file are

[puff::index::jointLog] [critical] The decoy file contained the names of 2211 decoy sequences, but 0 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed. [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

What is happening and how can i solve this issue?

tamuanand commented 2 years ago

@GiwaAO

Did you concatenate the transcriptome file and genome file (it has to be in this order) to create the gentrome file before salmon index

https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/

Along with the list of decoys salmon 
also needs the concatenated transcriptome 
and genome reference file for index. 

NOTE: the genome targets (decoys) should come 
after the transcriptome targets in the reference

cat gencode.vM23.transcripts.fa.gz GRCm38.primary_assembly.genome.fa.gz > gentrome.fa.gz

jwg054000 commented 1 year ago

I'm having a similar issue and have concatenated the transcriptome file and genome file. I also tried following the tutorial here (https://combine-lab.github.io/alevin-tutorial/2020/alevin-velocity/), but same issue as GiwaAO.

rob-p commented 1 year ago

Hi @jwg054000,

For single-cell processing, you should ideally move to alevin-fry. You can find a velocity tutorial for alevin-fry here.

Best, Rob

taylorreiter commented 1 year ago

I'm getting the same error as reported above. Copying the code i ran below:

# download reference genome
curl -JLO https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
# extract chromosome names
grep "^>" <(gunzip -c GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz) | cut -d " " -f 1 > GCF_009914755.1_T2T-CHM13v2.0_genomic.txt
# download transcriptome
curl -JLO https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_rna.fna.gz
# combine transcriptome and genome, in that order
cat GCF_000001405.40_GRCh38.p14_rna.fna.gz GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz > human_seq.fa.gz
# give to salmon to index
salmon index -t human_seq.fa.gz -i salmon_index -d GCF_009914755.1_T2T-CHM13v2.0_genomic.txt

mjoh223 commented 1 year ago

There must be a newline ("\n") at the end of the first file otherwise when the two are concatenated then you'll get a messed up fasta sequence at the seams.