COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
769 stars 161 forks source link

Indexing failed, decoy sequences not matching #788

Open GiwaAO opened 2 years ago

GiwaAO commented 2 years ago

Hello,

I tried creating a salmon index for bos taurus but was not successful. I created the decoy file using:

grep "^>" <(gunzip -c Bos_taurus.ARS-UCD1.2.dna.toplevel.fa.gz) | cut -d " " -f 1 > decoys.txt sed -i.bak -e 's/>//g' decoys.txt

When i try to index using salmon index -t bos_taurus_gentrome.fa.gz -d decoys.txt -p 12 -i salmon_index --gencode OR salmon index -t Bos_taurus.ARS-UCD1.2.cdna.all.fa.gz -i bos_taurus_107_index --decoys decoys.txt -k 31

I get an error. The last two lines of the log file are

[puff::index::jointLog] [critical] The decoy file contained the names of 2211 decoy sequences, but 0 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed. [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

What is happening and how can i solve this issue?

tamuanand commented 2 years ago

@GiwaAO

Did you concatenate the transcriptome file and genome file (it has to be in this order) to create the gentrome file before salmon index

Along with the list of decoys salmon 
also needs the concatenated transcriptome 
and genome reference file for index. 

NOTE: the genome targets (decoys) should come 
after the transcriptome targets in the reference

cat gencode.vM23.transcripts.fa.gz GRCm38.primary_assembly.genome.fa.gz > gentrome.fa.gz
jwg054000 commented 1 year ago

I'm having a similar issue and have concatenated the transcriptome file and genome file. I also tried following the tutorial here (https://combine-lab.github.io/alevin-tutorial/2020/alevin-velocity/), but same issue as GiwaAO.

rob-p commented 1 year ago

Hi @jwg054000,

For single-cell processing, you should ideally move to alevin-fry. You can find a velocity tutorial for alevin-fry here.

Best, Rob

taylorreiter commented 1 year ago

I'm getting the same error as reported above. Copying the code i ran below:

# download reference genome
curl -JLO https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
# extract chromosome names
grep "^>" <(gunzip -c GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz) | cut -d " " -f 1 > GCF_009914755.1_T2T-CHM13v2.0_genomic.txt
# download transcriptome
curl -JLO https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_rna.fna.gz
# combine transcriptome and genome, in that order
cat GCF_000001405.40_GRCh38.p14_rna.fna.gz GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz > human_seq.fa.gz
# give to salmon to index
salmon index -t human_seq.fa.gz -i salmon_index -d GCF_009914755.1_T2T-CHM13v2.0_genomic.txt
mjoh223 commented 1 year ago

There must be a newline ("\n") at the end of the first file otherwise when the two are concatenated then you'll get a messed up fasta sequence at the seams.