Dfam-consortium / RepeatModeler

De-Novo Repeat Discovery Tool
Other
182 stars 23 forks source link

GPU support for RMBlast in order to speed-up RepeatModeler #151

Open fgajardoe opened 2 years ago

fgajardoe commented 2 years ago

Hi, Thank you very much for developing this software.

I was wondering if it is possible to add GPU support to RMblast in order to speed up RepeatModeler runs.

I think it would be an useful alternative for the community, considering that RepeatModeler strongly relies on disk writing/reading speed and that most high-performance computing infrastructures have network-based storage implementations. Having GPU support would allow users to run RepeatModeler on their own desktop computers with significantly higher disk write/read speed, since it is not uncommon for desktop computer to have SSD storage. For instance, on a cluster i have access to, RepeatModeler reports an storage throughput of 387.75 MB/s, meanwhile on my desktop PC with an NVMe SSD the reported storage throughput is 1627.98 MB/s.

I did some research on this and found that there are actually a couple of BLAST implementations that take advantage of GPU processors, resulting in shorter run-times (they say it is even 4 times faster). See here for blastn; and here for blastp.

Do you think it would be possible to make a patch based on such BLAST implementations?

Felipe

edit: kindness

jebrosen commented 2 years ago

Do you think it would be possible to make a patch based on such BLAST implementations?

In principle yes; unfortunately I expect this to be a very significant undertaking. Currently RepeatModeler uses RMBlast, a modification to the original BLAST+ suite. Importantly, RMBlast added support for custom score matrices, which RepeatModeler uses because repetitive DNA and TEs are often under different evolutionary constraints and substitution rates than the sequences that BLAST is routinely used with.

G-BLASTN is also a modification of BLAST+, but it is based on an older version of blastn. After skimming the G-BLASTN paper and some of the code, it looks like it will already take a large amount of work simply to evaluate whether or not G-BLASTN is compatible, or could be made compatible, with more recent versions of BLAST+ and/or the changes that were made for RMBlast, and whether or not we have the resources to make any necessary modifications.

Perhaps more importantly, I am not too confident that we would be able to develop, test, and maintain such a variant in the long term. G-BLASTN is based on the NVIDIA-specific CUDA platform (which brings along compatibility, testing, and hardware issues), and NCBI has made and continues to make other improvements to the original BLAST+ which might or might not be translatable to G-BLASTN.

I will raise this question with our team and our collaborators to find out if GPU compute is something we can/should prioritize, whether through G-BLASTN or another sequence search implementation. Thank you for bringing this up!