[question]: What does the Warning mean when using localcolabfold

tamuanand commented 2 months ago

Hi

I am following these steps to run localcolabfold - https://github.com/YoshitakaMo/localcolabfold?tab=readme-ov-file#for-linux

After successful installation and setup, when I run colabfold_batch, I get this WARNING

WARNING: You are welcome to use the default MSA server, however keep in mind that it's a
limited shared resource only capable of processing a few thousand MSAs per day. Please
submit jobs only from a single IP address. We reserve the right to limit access to the
server case-by-case when usage exceeds fair use. If you require more MSAs: You can
precompute all MSAs with `colabfold_search` or host your own API and pass it to `--host-url`

Question: Does this mean that localcolabfold is still sending data to the colabfold MSA server?

This link suggests that localcolabfold will run locally.

Please advise.

Thanks in advance.

YoshitakaMo commented 2 months ago

Does this mean that localcolabfold is still sending data to the colabfold MSA server?

Yes. If one specifies a FASTA file as input, localcolabfold will send the sequence to the MSA server as performed on ColabFold on Google Colaboratory. localcolabfold receives the corresponding MSA file (in .a3m format) to start structure inference using your local GPU. If you want to run ColabFold entirely locally, you need extensive preparation. Please use setup_databases.sh script to download and build the databases (See also ColabFold Downloads). An instruction to run colabfold_search to obtain the MSA and templates locally is written at https://github.com/sokrypton/ColabFold/issues/563.

tamuanand commented 2 months ago

Hi @YoshitakaMo

I have used setup_databases.sh to download the different files. Do I need to do anything different to build them - I assume the databases are built already. I have the databases directory at the same level as localcolabfold directory. Do I need to pass any special flag to colabfold_batch to tell it to use the different databases from my local databases folder

If you want to run ColabFold entirely locally, you need extensive preparation. Please use setup_databases.sh script to download and build the databases (See also ColabFold Downloads). An instruction to run colabfold_search to obtain the MSA and templates locally is written at https://github.com/sokrypton/ColabFold/issues/563

I am also trying to replicate this but I end up getting an error

colabfold_search \
   --use-env 1 --use-templates 1 \
  --db-load-mode 2 \
   <path_to>/localcolabfold/colabfold-conda/bin/mmseqs \
    --db2 pdb100_230517 --threads 12 \
      ras_raf.fasta <path_to>/databases manual_ras_raf 

colabfold_search: error: unrecognized arguments: manual_ras_raf

Any ideas what I could be doing wrong?

Thanks

YoshitakaMo commented 2 months ago

I guess you forgot to add --mmseqs before <path_to>/localcolabfold/colabfold-conda/bin/mmseqs.

colabfold_search \
  --use-env 1 \
  --use-templates 1 \
  --db-load-mode 2 \
  --mmseqs <path_to>/localcolabfold/colabfold-conda/bin/mmseqs \ 
  --db2 pdb100_230517 \
  --threads 12 \
  ras_raf.fasta \
  <path_to>/databases \
  manual_ras_raf

tamuanand commented 2 months ago

Thanks @YoshitakaMo - yes, you are correct. I missed the --mmseqs.

Running it now and will update.

tamuanand commented 2 months ago

Hi @YoshitakaMo - I was able to use colabfold_search correctly.

I had a question on the path_to_pdb_mmcif files for the next step - running colabfold_batch using colabfold_search

I am trying to replicate this

I am using instructions from here

should I use --local-pdb-path <path_to>/databases/pdb or --local-pdb-path <path_to>/databases/pdb/divided

colabfold_batch --help has this

 --local-pdb-path LOCAL_PDB_PATH
                        Directory of a local mirror of the PDB mmCIF database (e.g.
                        /path/to/pdb/divided). If provided, PDB files from the directory are used
                        for templates specified by '--pdb-hit-file'. (default: None)

Thanks in advance.

YoshitakaMo commented 2 months ago

should I use --local-pdb-path <path_to>/databases/pdb or --local-pdb-path <path_to>/databases/pdb/divided

In my case, I prepared pdb_mmcif/mmcif_files containing xxxx.cif files using download_pdb_mmcif.sh, which is distributed at DeepMind's AlphaFold2 repository. The colabfold_batch prediction was performed with --local-pdb-path <path_to>/pdb_mmcif/mmcif_files and --pdb-hit-file foo_pdb100_230517.m8.

--local-pdb-path <path_to>/pdb_mmcif/mmcif_files can also automatically detect gzipped mmCIF files such as <path_to>/pdb_mmcif/mmcif_files/divided/xx/yxxz.cif.gz.

tamuanand commented 2 months ago

Thanks @YoshitakaMo for answering all questions.

In my case, I prepared pdb_mmcif/mmcif_files containing xxxx.cif files using download_pdb_mmcif.sh, which is distributed at DeepMind's AlphaFold2 repository. The colabfold_batch prediction was performed with --local-pdb-path /pdb_mmcif/mmcif_files and --pdb-hit-file foo_pdb100_230517.m8

Based on your note, I prepared pdb_mmcif/mmcif_files as above and I can run colabfold_batch on the mmcif files from DeepMind's AlphaFold2 repository.

What's the difference to be expected between using mmcif files from DeepMind's AF2 repository versus using the mmcif files that comes with the divided directory that comes after using setup_databases.sh. I realize I might have different number of cif files (225158 + 4538 obsolete = 229696) from today's download from AF2 repository when compared to the number of cif.gz files (224572 in divided + 4535 in obsolete = 229107) using setup_databases.sh

Thanks once again.

YoshitakaMo commented 2 months ago

I realize I might have different number of cif files (225158 + 4538 obsolete = 229696) from today's download from AF2 repository when compared to the number of cif.gz files (224572 in divided + 4535 in obsolete = 229107) using setup_databases.sh

The structural data of Protein Data Bank (PDB) is updated once a week. I suspect that the PDB data has been updated between the time when you previously used setup_database.sh to build the database with cif.gz files and today. The current number of entries is shown on https://www.rcsb.org/.

In any case, the template information that AlphaFold2/ColabFold retrieves from PDB is minimal in most cases, so it will likely not significantly impact on the prediction results. You can obtain nearly the same results regardless of the PDB version.

tamuanand commented 2 months ago

Thanks @YoshitakaMo

YoshitakaMo / localcolabfold

[question]: What does the Warning mean when using localcolabfold #258