flatironinstitute / deepblast

Neural Networks for Protein Sequence Alignment
BSD 3-Clause "New" or "Revised" License
114 stars 21 forks source link

Feasible approach to build a large database #159

Open yzlwk opened 6 months ago

yzlwk commented 6 months ago

Hello, I am trying to build a database from the NCBI nr FASTA (707338897 entries) for more extensive protein search. I have tried to split up the FASTA into smaller chunks (about 250 entries per run) and combine the result npy files. Larger chunks result in frequent GPU memory issue. I only have access to a 24GB GPU. However, it seems that this will take forever to finish (~ 3 years). I am wondering if there is any method to speed up this process.

mortonjt commented 6 months ago

Hi, no that is not feasible. Youd need a much larger GPU cluster to encode that many proteins.

We have considered using ESM2 instead of Protrans, then that could take advantage of the 700M proteins in mgnify. That'll require retraining both DeepBLAST and TMvec with the ESM2 model.

On Mon, Apr 22, 2024 at 12:21 AM Bryant @.***> wrote:

Hello, I am trying to build a database from the NCBI nr FASTA (707338897 entries) for more extensive protein search. I have tried to split up the FASTA into smaller chunks (about 250 entries per run) and combine the result npy files. Larger chunks result in frequent GPU memory issue. I only have access to a 24GB GPU. However, it seems that this will take forever to finish (~ 3 years). I am wondering if there is any method to speed up this process.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/deepblast/issues/159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXN5SNPMHBRIRPZPDYDY6SF3RAVCNFSM6AAAAABGSAGZQCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TKNJUGM3TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>