YoshitakaMo / localcolabfold

ColabFold on your local PC
MIT License
559 stars 128 forks source link

why my colabfold_search run takes too long time? #245

Open spark159 opened 1 month ago

spark159 commented 1 month ago

Caution: Please only report your issue related to the installation on your local PC or macOS. If you can get the help message by colabfold_batch --help or run a test prediction successfully, your installation is successful. Requests or questions regarding ColabFold features should be directed to ColabFold repo's issues.


What is your installation issue?

I tried to run colabfold_search in SLURM cluster but it takes like more than 2 days, even though input fasta is just single input.

Computational environment

I used this job allocations:

SBATCH -c 10 # Requested cores

SBATCH --time=2-00:00 # Runtime in D-HH:MM format

SBATCH --partition=medium # Partition to run in

SBATCH --mem=100GB # Requested Memory

SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j)

SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j)

To Reproduce

And this my colabfold_search execute command: colabfold_search \ --use-env 1 \ --use-templates 0 \ --db-load-mode 2 \ --mmseqs mmseqs \ --threads 4 \ ${input_path} \ ${database_path} \ ${output_path}

and my fasta input is simply this:

S1_S8 NHIIIPSYASWFDYNCIHVIERRALPEFFNGKNKSKTPEIYLAYRNFMIDTYRLNPQEYLTSTACRRNLTGDVCAVMRVHAFLEQWGLVNYQVDPESRPMAMGPPPTPHFNVLADTPSGLVPLHLRSPQVPAAQQMLNFPEKNKEKPVDLQNFGLRTDIYSKKTLAKSKGASAGREWTEQETLLLLEALEMYKDDWNKVSEHVGSRTQDECILHFLRLPIEDPYLENSDASLGPLAYQPVPFSQSGNPVMSTVAFLASVVDPRVASAAAKAALEEFSRVREEVPLELVEAHVKKVQEAAR:PLCTLLDWQDSLAKRCVCVSNTIRSLSFVPGNDFEMSKHPGLLLILGKLILLHHKHPERKQAPLTYEKEEEQDQGVSCNKVEWWWDCLEMLRENTLVTLANISGQLDLSPYPESICLPVLDGLLHWAVCPSAEAQDPFSTLGPNAVLSPQRLVLETLSKLSIQDNNVDLILATPPFSRLEKLYSTMVRFLSDRKNPVCREMAVVLLANLAQGDSLAARAIAVQKGSIGNLLGFLEDSLAATQFQQSQASLLHMQNPPFEPTSVDMMRRAARALLALAKVDENHSEFTLYESRLLDISVSPLMNSLVSQVICDVLFLIGQS

Expected behavior

I expected short run time like few hours, but it takes more than 2 days and job was cancelled. And I attaching the log file, too. 42792350.txt

Thank you for your help in advance!

YoshitakaMo commented 1 month ago

This issue is not about the installation itself. Because colabfold_search depends largely on the machine environment, such as the file system, storage (> 2TB SSD is highly recommended for best performance), RAM (> 768 GB for best performance), and whether or not vmtouch is used. If your job was run on a shared supercomputer and the file system is RAID or network mounted, the calculation speed will be too slow.

spark159 commented 1 month ago

Thank you so much for your kind response!

Just few more questions.

Do you have any suggestions to increase the speed of run in my current environment? What is the most important parameter for determining the performance speed? Probably, RAM memory (>768 GB)?

Thank you!

YoshitakaMo commented 1 month ago

In my experience, the most important factor is the file system and the use of SSD. If the sequence databases are placed on an SSD connected by a SATA cable, colabfold_search returns the result in 30~60 minutes even if the machine has only 64GB RAM (in my Desktop Ubuntu 22.04). However, using HDD or network-mounted drive will slow the calculation by more than 10 times. If the sequence databases can be fully cached on the RAM (>768GB) on the first run of colabfold_search, subsequent runs will be extremely fast, on par with the MMSeqs web server.