Open amorehead opened 5 months ago
Following are my current plans to generate all MSAs required for training
--min-seq-id 0.9 -c 0.8
(clustered everything at once, instead of separately; 62949752 cluster representatives) @milot-mirdita, thanks for putting this list of remaining MSA/template tasks together! Per your request, we are now differentiating RNA and DNA chains within the clustering outputs (e.g., chain sequence FASTA files). You can find these updated FASTA files in the shared OneDrive folder (within assembly1_clustering_data_caches.tar.gz
). Let me know if you have any questions about these new files.
Also, just a heads up that I'm currently assembling the AF3 validation dataset (similarly to how the training dataset was put together). Once I have the FASTA files for this validation set assembled, I'll let you know.
@milot-mirdita, you should now be able to find the AF3 validation dataset's filtered mmCIF files and clustering output files in the shared OneDrive directory. We will need to generate MSAs and templates for these as well. Note that the date range for the validation dataset (in contrast to the AF3 training dataset) is [2021-10-01, 2023-01-13]
. Also note that the stringent 40% sequence identity threshold (when applied to the validation peptide and nucleic acid chain and interface clusters with respect to the training dataset) left no peptide or nucleic acid validation examples in the dataset. Only protein and ligand chains/interfaces remain. In the next few days, I may revisit how stringently we filter out these validation chains/interfaces to recover the DNA, RNA, and peptide chains for cross-validation.
[ ] Datasets
[ ] AlphaFold 3 PDB dataset
[x] PDB mmCIF filtering script (
scripts/filter_pdb_mmcifs.py
)00/200l.cif
(@amorehead).07/207d.cif
(@amorehead).parse()
dBiomolecules
to have missing residues (e.g., residues1-10
) filled in (e.g., within chainB
of100d.cif
) by dummy residues upon exporting an mmCIF file from aBiomolecule
object. This specifically happens because sometimes authors of mmCIF files specify that residue indices should be monotonically increasing from the first chain to the last chain (e.g., residue indices 1-10 in chain A and residue indices 11-20 in chain B), and when this happens the standard AlphaFold 2-borrowed logic currently being used will treat residues 1-10 in chain B as "missing" and will add padding residues consequently. This will break future re-parsing of these (filtered) mmCIF files since the residue sequences e.g., in chain B will be incorrect from then on (@amorehead).remove_leaving_atoms
filtering step (@amorehead)._pdbx_struct_assembly
and_pdbx_struct_assembly_gen
mmCIF categories as well. Note that_pdbx_struct_oper_list
will be left unchanged, as there does not appear to be any harm in leaving it as is (although filtered file sizes might be slightly larger than minimally possible). This will ensure that future expansion of a (filtered) mmCIF's asymmetric units using these prescribed geometric transformations (per chain) is still possible after running the filtering script. In other words, this will make it possible to properly build bioassemblies (where certain chains may need to be duplicated via rotations and translations) after filtering out chains from each file (@amorehead).write_mmcif()
to enable correct re-parsing of these (filtered) mmCIF files (@amorehead).7a4d
and8a3j
(@amorehead).Biomolecule
object (prior to callingto_mmcif()
), reindex the residues of each chain to start at 1 and to be monotonically increasing to avoid downstream re-parsing errors (originally caused by exporting missing residues) for exported mmCIF files._pdbx_struct_assembly
and_pdbx_struct_assembly_gen
categories in the (filtered) mmCIF files created bywrite_mmcif()
, to ensure that the chain IDs included in these (exported) categories reflect the author chain IDs, not the original mmCIF (internal) chain IDs. Also handle the multi-mmCIF chain ID to single-author chain ID scenario in this implementation (@amorehead).[x] mmCIF I/O functions within the mmCIF filtering script (
scripts/filter_pdb_mmcifs.py
)test_unfiltered_mmcif_object_parsing()
) for theparse_mmcif_object()
function (@amorehead).test_filtered_mmcif_object_parsing()
) for thewrite_mmcif()
function (@amorehead).[x] PDB mmCIF clustering script (
scripts/cluster_pdb_mmcifs.py
)mmcif_parsing.parse
function as well as theBiomolecule
data structure (and its associated metadata) to make the sequence clustering logic (across the different residue types - e.g., RNA, proteins) much simpler and cleaner. Note that the current implementation is partially incomplete since I hit a point in the original implementation where the code was becoming much too messy and slow to be useful for clustering the full PDB dataset (@amorehead).clustalo
tommseqs
to speed up protein, nucleic acid, and peptide sequence clustering (@amorehead).[x] AlphaFold 3 dataloading
mmcif_object
and itsall_atom_positions
(on Line 47 ofalphafold3_pytorch/data/data_pipeline.py
) (@amorehead).Biomolecule
data structure (perhaps, or more likely thePDBInput
data structure). For example, it should be simple to build the fouris_protein/rna/dna/ligand
masks from theBiomolecule
data structure'schemtype
array (@amorehead).Biomolecule
creation by allowing multiple chain IDs to be passed to the_from_mmcif_object
function (@amorehead).WeightedPDBSampler
that performs the sampling algorithm presented in Section 2.5.1 of the AlphaFold 3 supplement (@vandrw).Biomolecule
object after chain/interface filtering it based on theWeightedPDBSampler
's sampled chain/interface IDs.WeightedPDBSampler
into thePDBDataset
class to load (cropped) mmCIF training examples (@sj900).pdb_input_to_molecule_input
function (@sj900).pdb_input_to_molecule_input
function (@sj900 and @amorehead).[ ] AlphaFold 3 genetic databases (@milot-mirdita)
[ ] AlphaFold 3 evaluation datasets
.cif
files for arbitrary sequence inputs using a trained model checkpoint (@dhuvik)[ ] AlphaFold 3 distillation datasets (currently de-prioritized)
PDBDataset
into theAlphafold3LitModule
for full training, validation, and testing support (@amorehead).ComputeModelSelectionScore
module into the validation and test loops (@amorehead).PDBInputs
(@amorehead)batch_size={1,2}
) (@amorehead).