amorehead commented 2 months ago

[ ] Datasets
- [ ] AlphaFold 3 PDB dataset
  - [x] PDB mmCIF filtering script (scripts/filter_pdb_mmcifs.py)
    - [x] Fix periodic residue count-chemical component count mismatch error that shows up for mmCIF files such as 00/200l.cif (@amorehead).
    - [x] Fix missing chemical component details error that shows up for certain chains within mmCIF files such as 07/207d.cif (@amorehead).
    - [x] Fix author chain ID-author residue ID issue causing parse()d Biomolecules to have missing residues (e.g., residues 1-10) filled in (e.g., within chain B of 100d.cif) by dummy residues upon exporting an mmCIF file from a Biomolecule object. This specifically happens because sometimes authors of mmCIF files specify that residue indices should be monotonically increasing from the first chain to the last chain (e.g., residue indices 1-10 in chain A and residue indices 11-20 in chain B), and when this happens the standard AlphaFold 2-borrowed logic currently being used will treat residues 1-10 in chain B as "missing" and will add padding residues consequently. This will break future re-parsing of these (filtered) mmCIF files since the residue sequences e.g., in chain B will be incorrect from then on (@amorehead).
    - [x] Finish the remove_leaving_atoms filtering step (@amorehead).
    - [x] Add a filtering step that removes any chains already contained in the chain removal set from the _pdbx_struct_assembly and _pdbx_struct_assembly_gen mmCIF categories as well. Note that _pdbx_struct_oper_list will be left unchanged, as there does not appear to be any harm in leaving it as is (although filtered file sizes might be slightly larger than minimally possible). This will ensure that future expansion of a (filtered) mmCIF's asymmetric units using these prescribed geometric transformations (per chain) is still possible after running the filtering script. In other words, this will make it possible to properly build bioassemblies (where certain chains may need to be duplicated via rotations and translations) after filtering out chains from each file (@amorehead).
    - [x] Export polymer and non-polymer residue sequences in each mmCIF created by write_mmcif() to enable correct re-parsing of these (filtered) mmCIF files (@amorehead).
    - [x] Fix parsing of ligand residues by tokenizing ligand atoms as "pseudoresidues" instead of as atoms of a parent ligand residue (@amorehead).
    - [x] Bug test the filtering script's execution by running it on the full PDB mmCIF dataset (e.g., look at the parsing errors arising from 7a4d and 8a3j (@amorehead).
    - [x] During construction of each Biomolecule object (prior to calling to_mmcif()), reindex the residues of each chain to start at 1 and to be monotonically increasing to avoid downstream re-parsing errors (originally caused by exporting missing residues) for exported mmCIF files.
    - [x] Export _pdbx_struct_assembly and _pdbx_struct_assembly_gen categories in the (filtered) mmCIF files created by write_mmcif(), to ensure that the chain IDs included in these (exported) categories reflect the author chain IDs, not the original mmCIF (internal) chain IDs. Also handle the multi-mmCIF chain ID to single-author chain ID scenario in this implementation (@amorehead).
    - [x] Filter the first assembly of each complex by referencing both the asymmetric units' mmCIF files and the first assemblies' mmCIF files concurrently. The output of the filtering script should then be the filtered first assembly mmCIF files (@amorehead).
  - [x] mmCIF I/O functions within the mmCIF filtering script (scripts/filter_pdb_mmcifs.py)
    - [x] Write unit test (i.e., test_unfiltered_mmcif_object_parsing()) for the parse_mmcif_object() function (@amorehead).
    - [x] Write unit test (i.e., test_filtered_mmcif_object_parsing()) for the write_mmcif() function (@amorehead).
  - [x] PDB mmCIF clustering script (scripts/cluster_pdb_mmcifs.py)
    - [x] Update the script to now use the new mmcif_parsing.parse function as well as the Biomolecule data structure (and its associated metadata) to make the sequence clustering logic (across the different residue types - e.g., RNA, proteins) much simpler and cleaner. Note that the current implementation is partially incomplete since I hit a point in the original implementation where the code was becoming much too messy and slow to be useful for clustering the full PDB dataset (@amorehead).
    - [x] Optimize the clustering script's runtime by switching from using clustalo to mmseqs to speed up protein, nucleic acid, and peptide sequence clustering (@amorehead).
  - [x] AlphaFold 3 dataloading
    - [x] Implement a function to build a bioassembly from an mmcif_object and its all_atom_positions (on Line 47 of alphafold3_pytorch/data/data_pipeline.py) (@amorehead).
    - [x] Create all features listed in Table 5 of the supplement from the finalized (i.e., filtered) PDB mmCIF dataset using e.g., the Biomolecule data structure (perhaps, or more likely the PDBInput data structure). For example, it should be simple to build the four is_protein/rna/dna/ligand masks from the Biomolecule data structure's chemtype array (@amorehead).
    - [x] Ensure mmCIF-derived bond metadata is only featurized and used during training and under the criteria described in Section 5.1 (@amorehead).
    - [x] Add interface ID filtering support to Biomolecule creation by allowing multiple chain IDs to be passed to the _from_mmcif_object function (@amorehead).
    - [x] Create a WeightedPDBSampler that performs the sampling algorithm presented in Section 2.5.1 of the AlphaFold 3 supplement (@vandrw).
    - [x] Add the cropping techniques described in Section 2.7 of the AlphaFold 3 supplement, which can be applied to a parsed Biomolecule object after chain/interface filtering it based on the WeightedPDBSampler's sampled chain/interface IDs.
      - [x] Add a contiguous sequence cropping function (@amorehead).
      - [x] Add a spatial cropping function (@amorehead).
      - [x] Add a spatial interface cropping function (@amorehead).
    - [x] Wire up the WeightedPDBSampler into the PDBDataset class to load (cropped) mmCIF training examples (@sj900).
    - [x] Create MSA feature loading functions within the pdb_input_to_molecule_input function (@sj900).
    - [x] Expand MSA feature loading to support all of life's molecules (@amorehead).
    - [x] Create template feature loading functions within the pdb_input_to_molecule_input function (@sj900 and @amorehead).
- [ ] AlphaFold 3 genetic databases (@milot-mirdita)
  - [ ] Curate Python helper scripts to run each of the alignment tools and configurations listed in Tables 1 and 2 of the AlphaFold 3 supplement (n.b., the AlphaFold 2 and OpenFold repos most likely support much of this functionality already, port over as necessary).
  - [ ] Curate download scripts for each genetic database (n.b., once again, AlphaFold 2 or OpenFold's code can be referenced here).
- [ ] AlphaFold 3 evaluation datasets
  - [x] AlphaFold 3 validation set (see Section 5.8) (@amorehead)
  - [x] Recent PDB evaluation set (see Sections 6.1 and 6.2) (will also perform CASP16 benchmarking) (@amorehead)
  - [ ] Add an inference script that one can use to generate predicted .cif files for arbitrary sequence inputs using a trained model checkpoint (@dhuvik)
  - [ ] Assemble evaluation metrics (see Section 6.3 and 6.4) (will use CASP's native metrics for CASP16 benchmarking)
- [ ] AlphaFold 3 distillation datasets (currently de-prioritized)
  - [ ] Curate the Mgnify protein monomer predictions dataset following Table 3 of the supplement.
  - [ ] Curate the Mgnify short protein monomer predictions dataset following Table 3 of the supplement.
  - [ ] Curate the RFAM RNA monomer predictions dataset following Table 3 of the supplement (after training v1 of the AlphaFold 3 weights).
  - [ ] Curate the Mgnify protein + random DNA dataset following Table 3 of the supplement.
  - [ ] Curate the DNA + protein predictions from JASPAR dataset following Table 3 of the supplement.
[x] Modeling
- [x] Metrics
  - [x] Model selection
    - [x] Implement a version of the model selection algorithm described in Section 5.7 of the AlphaFold 3 supplement (@xluo233).
  - [x] Confidence measures and sample ranking
    - [x] Implement alignment-based confidence measures (see Section 5.9.1) (@xluo233).
    - [x] Implement clash penalty in ranking (see Section 5.9.2) (@xluo233).
    - [x] Implement sample ranking (see Section 5.9.3) (@xluo233).
- [x] Training
  - [x] Wire up the PDBDataset into the Alphafold3LitModule for full training, validation, and testing support (@amorehead).
  - [x] Wire in the ComputeModelSelectionScore module into the validation and test loops (@amorehead).
  - [x] Wire in PAE, PDE, plDDT, and resolved loss labels to PDBInputs (@amorehead)
  - [x] Annotate output mmCIFs with plDDTs if available (@amorehead)
  - [x] As a sanity check, overfit the model to a couple of complexes, and visualize the outputs over the course of training (using batch_size={1,2}) (@amorehead).
  - [x] Implement chain permutation and symmetry resolution (Section 4.2) (@amorehead)

milot-mirdita commented 1 month ago

Homology search

Following are my current plans to generate all MSAs required for training

Nucleotide

[x] Collect Rfam 14.9, RNACentral 21 and NT (older version of 202307 available locally)
[x] Cluster to --min-seq-id 0.9 -c 0.8 (clustered everything at once, instead of separately; 62949752 cluster representatives)
[ ] Not sure how to search this yet, as nhmmer is slow. Might introduce a MMseqs2 prefiltering and a second stage nhmmer procedure.
[x] Separated DNA and RNA chains would be helpful @amorehead

Protein

[ ] Search against uniref30_2202 and colabfold_envdb_202108 to somewhat match DM's protein db cutoff of 2205.

Templates

[ ] Search with the colabfold pipeline against pdb_seqres with same cut-off as the other training data.

Protein multimers

[ ] Need a pregenerated list of sampled multimers as full length sequences, this can be cropped on the fly (?)
[ ] Search against uniref30_2202
[ ] Taxonomic pairing

amorehead commented 1 month ago

@milot-mirdita, thanks for putting this list of remaining MSA/template tasks together! Per your request, we are now differentiating RNA and DNA chains within the clustering outputs (e.g., chain sequence FASTA files). You can find these updated FASTA files in the shared OneDrive folder (within assembly1_clustering_data_caches.tar.gz). Let me know if you have any questions about these new files.

Also, just a heads up that I'm currently assembling the AF3 validation dataset (similarly to how the training dataset was put together). Once I have the FASTA files for this validation set assembled, I'll let you know.

amorehead commented 1 month ago

@milot-mirdita, you should now be able to find the AF3 validation dataset's filtered mmCIF files and clustering output files in the shared OneDrive directory. We will need to generate MSAs and templates for these as well. Note that the date range for the validation dataset (in contrast to the AF3 training dataset) is [2021-10-01, 2023-01-13]. Also note that the stringent 40% sequence identity threshold (when applied to the validation peptide and nucleic acid chain and interface clusters with respect to the training dataset) left no peptide or nucleic acid validation examples in the dataset. Only protein and ligand chains/interfaces remain. In the next few days, I may revisit how stringently we filter out these validation chains/interfaces to recover the DNA, RNA, and peptide chains for cross-validation.

amorehead / alphafold3-pytorch-lightning-hydra

TODOs #1

Homology search

Nucleotide

Protein

Templates

Protein multimers