aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.61k stars 478 forks source link

Trouble following documentation #226

Open ccsiszer opened 1 year ago

ccsiszer commented 1 year ago

Hi. I am trying to follow the documentation to install and train the model. I have successfully installed everything and have run the following commands so far, also successfully: bash scripts/download_alphafold_dbs.sh data/ bash scripts/download_mmseqs_dbs.sh data/ bash scripts/prep_mmseqs_dbs.sh data/

In my data directory, I have the following: total 407176420 drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 19:16 bfd drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 14:05 colabfold -rw-rw-r-- 1 ubuntu ubuntu 117965643010 Sep 30 21:20 colabfold_envdb_202108.tar.gz drwxrwxr-x 2 ubuntu ubuntu 38912 Oct 1 19:03 mmseqs_dbs drwxrwxr-x 5 ubuntu ubuntu 6144 Oct 1 18:45 tmp drwxrwxr-x 2 ubuntu ubuntu 6144 Oct 3 15:03 uniref30 -rw-rw-r-- 1 ubuntu ubuntu 149491476480 Sep 30 16:23 uniref30_2103.tar -rw-rw-r-- 1 ubuntu ubuntu 149491476480 Oct 1 09:47 uniref30_2103.tar.gz

I am now trying to run the training part but I feel I am missing the data I need. For instance, I thought I was going to be able to do this: python3 scripts/precompute_alignments_mmseqs.py input.fasta \ data/mmseqs_dbs \ uniref30_2103_db \ alignment_dir \ ~/MMseqs2/build/bin/mmseqs \ /usr/bin/hhsearch \ --env_db colabfold_envdb_202108_db --pdb70 data/pdb70/pdb70

But I don't seem to have what I need to create the input.fasta file and I also don't have colabfold_envdb_202108_db and data/pdb70/pdb70.

Can someone kindly point me in the right direction? I am no a data scientists, I am a data engineer/IT/wear many hats person so if I say something that doesn't make much sense in terms of models, etc. I apologize.

Thank you.

gahdritz commented 1 year ago

Hm. PDB70 should be downloaded by download_alphafold_dbs.sh and the ColabFold database by download_mmseqs_dbs.sh. If those two didn't end up in your data/ directory, the downloads must have failed for some reason. Do you still have the console output from when you ran the download scripts? If so, post it here and I can try to determine what went wrong. If not, try re-running the commands in each of those scripts corresponding to the two databases (both of those scripts are just lists of calls to database-specific scripts).

The input FASTA file should contain the sequences for which you compute alignments, and so isn't included by default in the downloaded data. If you just want a large database of any alignments, I recommend checking out OpenProteinSet, our database of 4.5 million precomputed MSAs, of which 400k also come with template hits and AF structure predictions: https://registry.opendata.aws/openfold/

ccsiszer commented 1 year ago

Hi Gahdritz, Unfortunately I don't have the console output of the download scripts. To give you context, all I am trying to do is see if I can train the model on a small data set as a proof of concept project. Following your advice, I downloaded data from https://registry.opendata.aws/openfold/. Specifically, I created 3 directories in my "data" directory called pdb, uniclust30 and uniclust30_overflow. Inside each, I put 1000 entries from the respective locations in https://registry.opendata.aws/openfold/. However, I am still struggling to get this right. For instance, I am trying to run this code: python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ output_dir/ \ 2021-10-10 \ --template_release_dates_cache_path mmcif_cache.json \ --precision bf16 \ --gpus 8 --replace_sampler_ddp=True \ --seed 4242022 \ # in multi-gpu settings, the seed must be specified --deepspeed_config_path deepspeed_config.json \ --checkpoint_every_epoch \ --resume_from_ckpt ckpt_dir/ \ --train_chain_data_cache_path chain_data_cache.json \ --obsolete_pdbs_file_path obsolete.dat Before I can do that, I need the mmcif_cache.json file, as an example, but I can't generate it because I don't seem to have any .cif files. Can you guide me on how to do this with the data I have from https://registry.opendata.aws/openfold/?

gahdritz commented 1 year ago

If you ran the download scripts, you probably already have the Protein Data Bank mmCIF files. If not, you can run scripts/download_pdb_mmcif.sh to fetch them.

The uniclust30 MSAs are bundled with .pdb files---you should use these in place of mmCIF files for those chains, which are from UniProt, not PDB. Just dump the mmCIF files and the .pdb files in to the so-called mmcif_dir/ when you run the training command, making sure that, for every subdirectory of the alignment_dir, each of which should correspond to a single chain, there exists a corresponding structural data file in mmcif_dir.

ccsiszer commented 1 year ago

This is what I see now: ubuntu@run-63400feaae186841a97025d6-k2xjc:/opt/openfold$ /opt/conda/bin/python3 train_openfold.py /domino/datasets/local/openfold/mmcif_dir/ /domino/datasets/local/openfold/alignment_dir/ /domino/datasets/local/openfold/mmcif_dir/ /domino/datasets/local/openfold/output_dir/ 2021-10-10 --template_release_dates_cache_path /domino/datasets/local/openfold/mmcif_cache.json --train_chain_data_cache_path /domino/datasets/local/openfold/chain_data_cache.json WARNING:root:Removing 3 alignment entries (mgnify_hits.a3m, bfd_uniclust_hits.a3m, uniref90_hits.a3m) with no corresponding entries in chain_data_cache (/domino/datasets/local/openfold/chain_data_cache.json). GPU available: False, used: False TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py:470: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup. f"DataModule.{name} has already been called, so it will not be called again. "

| Name | Type | Params

0 | model | AlphaFold | 93.2 M 1 | loss | AlphaFoldLoss | 0

93.2 M Trainable params 0 Non-trainable params 93.2 M Total params 372.916 Total estimated model params size (MB) Traceback (most recent call last): File "train_openfold.py", line 573, in main(args) File "train_openfold.py", line 364, in main ckpt_path=ckpt_path, File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 741, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run self._dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch self.training_type_plugin.start_training(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage return self._run_train() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train self.fit_loop.run() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 140, in run self.on_run_start(args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 197, in on_run_start self.trainer.reset_train_val_dataloaders(self.trainer.lightning_module) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 595, in reset_train_val_dataloaders self.reset_train_dataloader(model=model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 365, in reset_train_dataloader self.train_dataloader = self.request_dataloader(RunningStage.TRAINING, model=model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 611, in request_dataloader dataloader = source.dataloader() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/data_connector.py", line 300, in dataloader return method() File "/opt/openfold/openfold/data/data_modules.py", line 726, in train_dataloader return self._gen_dataloader("train") File "/opt/openfold/openfold/data/data_modules.py", line 703, in _gen_dataloader dataset.reroll() File "/opt/openfold/openfold/data/data_modules.py", line 416, in reroll datapoint_idx = next(samples) File "/opt/openfold/openfold/data/data_modules.py", line 369, in looped_samples candidate_idx = next(idx_iter) File "/opt/openfold/openfold/data/data_modules.py", line 355, in looped_shuffled_dataset_idx generator=self.generator, RuntimeError: cannot sample n_sample <= 0 samples ubuntu@run-63400feaae186841a97025d6-k2xjc:/opt/openfold$

Here's what my directories look like: ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$ ls -l mmcif_dir/ total 372 -rw-rw-r-- 1 ubuntu ubuntu 377497 Jan 8 2022 11gs.cif ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$

ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$ ls -l alignment_dir/ total 3192 -rw-rw-r-- 1 ubuntu ubuntu 425707 Oct 7 14:07 bfd_uniclust_hits.a3m -rw-rw-r-- 1 ubuntu ubuntu 194812 Oct 7 14:07 mgnify_hits.a3m -rw-rw-r-- 1 ubuntu ubuntu 2643379 Oct 7 14:07 uniref90_hits.a3m ubuntu@run-63400feaae186841a97025d6-k2xjc:/domino/datasets/local/openfold$

What am I missing? Thank you so much for your help!

ccsiszer commented 1 year ago

@gahdritz, can you please help?

Starting from the beginning, I downloaded all the data available here: https://registry.opendata.aws/openfold/

Specifically, aws s3 ls --no-sign-request s3://openfold/ PRE openfold_params/ PRE pdb/ PRE uniclust30/ PRE uniclust30_overflow/ 2022-06-17 03:35:44 18657 LICENSE 2022-08-28 21:57:09 4524064 duplicate_pdb_chains.txt

I downloaded the pdb, uniclust30 and uniclust30_overflow directories. I am just trying to test things out so instead of attempting to train the model on all the data, I moved 1000 directories from each of the directories above (pdb, uniclust30 and uniclust30_overflow) to another location so I have smaller versions of each.

Since then, I have been trying to run the training script.

Following the documentation, I was able to run this: python3 scripts/generate_mmcif_cache.py \ mmcif_dir/ \ mmcif_cache.json \ --no_workers 16

But only after downloading the .cif files corresponding to some entries in the pdb directory. I downloaded those cif files from here: s3://pdbsnapshots/20220103/pub/pdb/data/structures/all/mmCIF/

After generating the mmcif_cache.json file, I was able to run this: python3 scripts/generate_chain_data_cache.py \ mmcif_dir/ \ chain_data_cache.json \ --cluster_file clusters-by-entity-40.txt \ --no_workers 16

Now I am trying to run this: python3 train_openfold.py mmcif_dir/ alignment_dir/ template_mmcif_dir/ output_dir/ \ 2021-10-10 \ --template_release_dates_cache_path mmcif_cache.json \ --precision bf16 \ --gpus 8 --replace_sampler_ddp=True \ --seed 4242022 \ # in multi-gpu settings, the seed must be specified --deepspeed_config_path deepspeed_config.json \ --checkpoint_every_epoch \ --resume_from_ckpt ckpt_dir/ \ --train_chain_data_cache_path chain_data_cache.json \ --obsolete_pdbs_file_path obsolete.dat

But I keep getting the sampling error (RuntimeError: cannot sample n_sample <= 0 samples)

Can you please point me in the right direction, keeping in mind that I am not familiar at all with openfold? Thank you so much.

vetmax7 commented 9 months ago

Hello!

Could you find a solution?