Different output formats report different numbers of HSPs

wsnoble commented 2 years ago

This is probably a dumb question, but I don't understand why when I switch output format from 6 (blast-style) to 102 (taxonomy), the number of pairwise alignments changes. I made a small sample FASTA and queried it against uniref100, and the number of alignments drops from 1525 to 52. Here are the two command lines:

diamond blastp --db ../2022-07-26metaproteomics/uniref100.db.dmnd --query sample.fasta --out outfmt6.diamond.txt --outfmt 6 qseqid stitle pident evalue mismatch --min-orf 1 --log
diamond blastp --db ../2022-07-26metaproteomics/uniref100.db.dmnd --query sample.fasta --out outfmt102.diamond.txt --outfmt 102 --min-orf 1 --log

Log files and sample fasta are attached. outfmt6.diamond.log.txt outfmt102.diamond.log.txt sample.fasta.txt

wsnoble commented 2 years ago

outfmt6.diamond.txt outfmt102.diamond.txt

In case it's helpful, here are the output files.

bbuchfink commented 2 years ago

The taxonomic format does not output alignments. It merges all alignments for one query into one classification using the LCA algorithm.

wsnoble commented 2 years ago

OK, that makes sense. So in my example, there are four sequences that have at least one associated alignment:

% awk '{print $1}' outfmt6.diamond.txt | sort -u 
scan111
scan139
scan178
scan205

But there are no taxonomic assignments at all. Why is that? I would have expected to see non-zero entries for those four sequences.

bbuchfink commented 2 years ago

It's possible that the targets they align against have no taxonomy assignment, you can check for that.

wsnoble commented 2 years ago

As far as I can tell from the annotations in uniref100, they do have taxonomy assignments. For example, here are the entries for the first sequence:

scan111 UniRef100_A0A2M7VQ01 RelA/SpoT family protein n=2 Tax=Bacteria TaxID=2 RepID=A0A2M7VQ01_9FLAO   92.6    2.78e-05    2
scan111 UniRef100_A0A1J4TFK1 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium CG1_02_35_72 TaxID=1805158 RepID=A0A1J4TFK1_9FLAO 92.6    2.78e-05    2
scan111 UniRef100_A0A3R7T962 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Flavobacteriaceae bacterium TMED81 TaxID=1986719 RepID=A0A3R7T962_9FLAO  88.9    5.20e-05    3
scan111 UniRef100_A0A2G6F384 RelA/SpoT family protein n=1 Tax=Flavobacteriia bacterium TaxID=2044941 RepID=A0A2G6F384_9BACT 88.9    7.12e-05    3
scan111 UniRef100_UPI001CD3543A RelA/SpoT family protein n=1 Tax=Lutimonas saemankumensis TaxID=483016 RepID=UPI001CD3543A  85.2    9.74e-05    4
scan111 UniRef100_A0A2E8SD77 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium TaxID=1871037 RepID=A0A2E8SD77_9FLAO  85.2    9.74e-05    4
scan111 UniRef100_A0A2D6Y7W9 RelA/SpoT family protein n=1 Tax=Flavobacteriaceae bacterium TaxID=1871037 RepID=A0A2D6Y7W9_9FLAO  85.2    9.74e-05    4
scan111 UniRef100_A0A7R8X324 Hypothetical protein (Fragment) n=1 Tax=Cyprideis torosa TaxID=163714 RepID=A0A7R8X324_9CRUS   85.2    1.12e-04    4
scan111 UniRef100_A0A432IBK3 RelA/SpoT family protein n=1 Tax=Flavobacteriia bacterium TaxID=2044941 RepID=A0A432IBK3_9BACT 85.2    1.33e-04    4
scan111 UniRef100_A0A7C6IMR6 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Bacteroidales bacterium TaxID=2030927 RepID=A0A7C6IMR6_9BACT 85.2    1.82e-04    4
scan111 UniRef100_A0A838ZTQ7 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Moheibacter lacus TaxID=2745851 RepID=A0A838ZTQ7_9FLAO   85.2    1.82e-04    4
scan111 UniRef100_A0A3D6CIB6 RelA/SpoT family protein n=1 Tax=Polaribacter sp. TaxID=1920175 RepID=A0A3D6CIB6_9FLAO 88.9    1.82e-04    3
scan111 UniRef100_A0A2I2MBN6 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2I2MBN6_9FLAO    85.2    2.49e-04    4
scan111 UniRef100_A0A2H1XY37 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2H1XY37_9FLAO    85.2    2.49e-04    4
scan111 UniRef100_A0A2H1YCR3 MFS transporter n=2 Tax=Tenacibaculum finnmarkense TaxID=2781243 RepID=A0A2H1YCR3_9FLAO    85.2    2.49e-04    4
scan111 UniRef100_A0A2H1YH65 MFS transporter n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=A0A2H1YH65_9FLAO 85.2    2.49e-04    4
scan111 UniRef100_UPI001F227220 RelA/SpoT family protein n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=UPI001F227220    85.2    2.49e-04    4
scan111 UniRef100_UPI001EFAD267 RelA/SpoT family protein n=1 Tax=Tenacibaculum piscium TaxID=1458515 RepID=UPI001EFAD267    85.2    2.49e-04    4
scan111 UniRef100_A0A2H1XT05 MFS transporter n=1 Tax=Tenacibaculum dicentrarchi TaxID=669041 RepID=A0A2H1XT05_9FLAO 85.2    2.49e-04    4
scan111 UniRef100_A0A839AMJ4 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=1 Tax=Tenacibaculum pelagium TaxID=2759527 RepID=A0A839AMJ4_9FLAO  85.2    2.49e-04    4
scan111 UniRef100_A0A0U3K3R5 MFS transporter n=1 Tax=Tenacibaculum dicentrarchi TaxID=669041 RepID=A0A0U3K3R5_9FLAO 85.2    2.49e-04    4
scan111 UniRef100_A0A3Q8RRV4 Bifunctional (P)ppGpp synthetase/guanosine-3',5'-bis(Diphosphate) 3'-pyrophosphohydrolase n=2 Tax=Tenacibaculum TaxID=104267 RepID=A0A3Q8RRV4_9FLAO    85.2    2.49e-04    4
scan111 UniRef100_A0A2G1BXG4 RelA/SpoT family protein n=2 Tax=Tenacibaculum TaxID=104267 RepID=A0A2G1BXG4_9FLAO 85.2    2.49e-04    4
scan111 UniRef100_A0A1M5GLM6 GTP pyrophosphokinase n=1 Tax=Tenacibaculum mesophilum TaxID=104268 RepID=A0A1M5GLM6_9FLAO 85.2    2.49e-04    4
scan111 UniRef100_A0A4R6THM8 GTP pyrophosphokinase n=1 Tax=Tenacibaculum caenipelagi TaxID=1325435 RepID=A0A4R6THM8_9FLAO   85.2    2.49e-04    4

Shouldn't these be picked up by the NCBI taxonomy files?

bbuchfink commented 2 years ago

No, the NCBI files do not map Uniprot accessions. I'll have to look into providing taxonomy mapping for uniprot.

wsnoble commented 2 years ago

Is there a different database you recommend? I'd like to get something non-redundant that is as extensive as possible but only with sequences that have taxonomic assignments.

bbuchfink commented 2 years ago

The nr database would be the most obvious choice for an extensive database, and I think most sequences should have a taxonomic assignment. AnnoTree could also be an option, see here: https://journals.asm.org/doi/full/10.1128/msystems.01408-21

wsnoble commented 2 years ago

Thanks, I will try those. I agree with you that adding support for uniprot accessions would probably be good, but you don't need to prioritize it on my account.

wsnoble commented 2 years ago

OK, I downloaded the NR database from NCBI, but when I map a bunch of human peptides I still get only a tiny fraction that are assigned a taxonomic unit. Here is the command line for building the DB:

diamond makedb --in /net/noble/vol1/data/ncbi/pub/blast/db/FASTA/nr.gz --db nr.db --taxonmap /net/noble/vol1/data/ncbi/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz --taxonnodes /net/noble/vol1/data/ncbi/pub/taxonomy/nodes.dmp --taxonnames /net/noble/vol1/data/ncbi/pub/taxonomy/names.dmp --log

And here is the command line for the query:

diamond blastp --db nr.db --query human.fasta --out human.diamond.txt --outfmt 102 --log

The fasta file contains human peptide sequences, but only 8 out of 11,688 get matched. The input file, log files, and output file are attached. I must be doing something wrong, but I'm not sure what it is.

human.diamond.log.txt human.diamond.txt human.fasta.txt nr.db.log.txt

bbuchfink commented 2 years ago

Diamond was not designed for such short sequences. With some tuning of the parameters that may work but also be pretty slow if you use such a big database as the NR. Is BLAST too slow for you or what's your reason for using Diamond?

wsnoble commented 2 years ago

I was following the lead of this article, which used DIAMOND for taxonomic assignments for metaproteomics data: https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00334

But yes, I suppose I could just use BLAST.

It occurs to me that maybe the problem is that I didn't unzip the nr.gz file before doing the makedb command.

bbuchfink commented 2 years ago

I was following the lead of this article, which used DIAMOND for taxonomic assignments for metaproteomics data: https://pubs.acs.org/doi/10.1021/acs.jproteome.2c00334

They used Diamond for this sort of data, but surely must have modified the parameters. Let me know in case it doesn't work with BLAST, and I'll take a look.

It occurs to me that maybe the problem is that I didn't unzip the nr.gz file before doing the makedb command.

This is no problem, Diamond can read gzipped files.

wsnoble commented 2 years ago

I had already contacted the authors of that JPR paper, and here is the command line they used:

diamond blastp -d uniref100 --min-score 1 -q kaiko.fasta -o kaiko.dmd -f 6 qseqid stitle pident evalue mismatch

They then post-processed the results to consider only exact matches. They told me that they fail to map many of the sequences at all. If you have any suggestions for how to nudge DIAMOND toward handling short sequences, I'd love to hear them.

FWIW, I tried this exact mapping tool, but to index NR (which is 254 G), you need a machine with >254 G of RAM.

wsnoble commented 2 years ago

Just circling back on this. Do you have any suggestions for parameter settings that would improve DIAMOND's handling of short sequences?

bbuchfink commented 2 years ago

These parameters seem to work ok for me for your data:

-c1 --ultra-sensitive -s1 --id2 1 --short-query-ungapped-bitscore 1 -e10000 --algo 0 --masking 0 --gapped-filter-evalue 0

To use all of them, you need to compile from source using the cmake option -DEXTRA=ON. The number of shapes -s and the evalue could also be increased for higher sensitivity.

wsnoble commented 2 years ago

Thanks! This is super helpful. Now it succeeds in matching 29% of the peptides.

bbuchfink / diamond

Different output formats report different numbers of HSPs #607