Feasible approach to build a large database

Hi, no that is not feasible. Youd need a much larger GPU cluster to encode that many proteins.

We have considered using ESM2 instead of Protrans, then that could take advantage of the 700M proteins in mgnify. That'll require retraining both DeepBLAST and TMvec with the ESM2 model.

On Mon, Apr 22, 2024 at 12:21 AM Bryant @.***> wrote:

Hello, I am trying to build a database from the NCBI nr FASTA (707338897 entries) for more extensive protein search. I have tried to split up the FASTA into smaller chunks (about 250 entries per run) and combine the result npy files. Larger chunks result in frequent GPU memory issue. I only have access to a 24GB GPU. However, it seems that this will take forever to finish (~ 3 years). I am wondering if there is any method to speed up this process.

— Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/deepblast/issues/159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA75VXN5SNPMHBRIRPZPDYDY6SF3RAVCNFSM6AAAAABGSAGZQCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TKNJUGM3TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

flatironinstitute / deepblast

Feasible approach to build a large database #159