Open naailkhan28 opened 10 months ago
Thank you so much for this! It’s super interesting to see some alternatives to BLAST in action. We haven't experienced the major issues that you've had with BLAST, but the BLAST step is something we’re keeping our eye on. Additionally, because we’re trying to keep our analyses more focused at the protein family level, we generally like the depth we get with BLAST searches. So, I think for now we’ll leave BLAST as the default, but mmseqs2 does seem like a great alternative if BLAST isn’t working well for certain users or if users do want to get many more sequences covering a pretty broad sequence space, so we’ll leave this as an alternative option! And if we do decide to swap out the BLAST step in the future, we'll definitely be in touch. Thanks again, we really appreciate your feedback and input!
Description of feature
As discussed in #67 , the BLAST search step takes a very long time and is also prone to errors. I've experimented with replacing this with mmseqs2 which is much faster than BLAST and also more sensitive.
Code
See this commit for my proposed changes:
https://github.com/naailkhan28/ProteinCartography/commit/46a1a4493f163bb9875b6bdd846dd490ec0dd77d
I've added files
mmseqs2_utils.py
andrun_mmseqs2.py
which are analogous to theblast_utils.py
andrun_blast.py
scripts from before.run_mmseqs2.py
I adapted this script from the ColabFold repo which queries the mmseqs2 server.I've also provided an updated Snakefile which implements these new rules. There's a step to query the API, a step to process the resulting
.a3m
file, and write an output.csv
file with hits and a.txt
file with UniRef100 hits. Most of these can be supplied directly to the nextaggregate_lists
rule but some of them are UniParc accessions starting withUPIXXX
and so I've written a scriptmap_uniparc_ids.py
which will usebioservices
to map these to UniProt KB accessions where possible.Results
The biggest performance benefit of mmseqs2 is speed - requests to the API typically take under 20 seconds whereas BLAST searches often take 30-40 minutes or more.
But with this increased speed we also get many, many more sequence hits back. Here's a histogram of sequence identity of all the hits you get back against the Actin example, with mmseqs2 (with a max E-value of 1e-03), compared to the default setting of BLAST with max 3000 sequences:
Loads more sequences, and covering much more of sequence space too! Of course you can increase the number of BLAST hits, but this greatly increases the search time (and sometimes gives errors too). I tried running BLAST with 100,000 max sequences and it's still not given me a result after several hours!
mmseqs2 also recovers most of the BLAST hits anyways. Here's a venn diagram of the Actin sequence hits, and associated structure hits:
Only a few hundred hits are unique to BLAST, and as the previous graph shows mmseqs2 has far greater coverage of sequences with lower identity anyways.
Happy to chat more about this idea - my code is pretty basic for now and not polished at all, but I'd be open to working on this idea some more if you guys think it's worthwhile.