amorehead / alphafold3-pytorch-lightning-hydra

Implementation of AlphaFold 3 in PyTorch Lightning + Hydra
MIT License
31 stars 7 forks source link

TODOs #1

Open amorehead opened 5 months ago

amorehead commented 5 months ago
milot-mirdita commented 4 months ago

Homology search

Following are my current plans to generate all MSAs required for training

Nucleotide

Protein

Templates

Protein multimers

amorehead commented 4 months ago

@milot-mirdita, thanks for putting this list of remaining MSA/template tasks together! Per your request, we are now differentiating RNA and DNA chains within the clustering outputs (e.g., chain sequence FASTA files). You can find these updated FASTA files in the shared OneDrive folder (within assembly1_clustering_data_caches.tar.gz). Let me know if you have any questions about these new files.

Also, just a heads up that I'm currently assembling the AF3 validation dataset (similarly to how the training dataset was put together). Once I have the FASTA files for this validation set assembled, I'll let you know.

amorehead commented 4 months ago

@milot-mirdita, you should now be able to find the AF3 validation dataset's filtered mmCIF files and clustering output files in the shared OneDrive directory. We will need to generate MSAs and templates for these as well. Note that the date range for the validation dataset (in contrast to the AF3 training dataset) is [2021-10-01, 2023-01-13]. Also note that the stringent 40% sequence identity threshold (when applied to the validation peptide and nucleic acid chain and interface clusters with respect to the training dataset) left no peptide or nucleic acid validation examples in the dataset. Only protein and ligand chains/interfaces remain. In the next few days, I may revisit how stringently we filter out these validation chains/interfaces to recover the DNA, RNA, and peptide chains for cross-validation.