dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
934 stars 284 forks source link

scripts to generate PDBID_CHAINID.pt and PDBID.pt #42

Closed johnnyp117 closed 1 year ago

johnnyp117 commented 1 year ago

I'd like to use different PDB files to train with, was wondering if there is a script that was used to auto generate the two .pt files mentioned in the issue header. Thanks :).

smilenaderi commented 1 year ago

Yes I really need it to? Did you manage to solve it? if so could you please share thanks very much

smilenaderi commented 1 year ago

@decarboxy is there any updates on this? could you please share your codes for preparing data. thanks

dauparas commented 1 year ago

I added the script for making PDBID_CHAINID.pt and PDBID.pt: https://github.com/dauparas/ProteinMPNN/blob/main/training/parse_cif_noX.py

smilenaderi commented 1 year ago

Thank you so much for sharing!

adrienchaton commented 2 weeks ago

@dauparas thanks a lot for sharing this script!

some related discussion is happening here https://github.com/dauparas/ProteinMPNN/issues/54

currently, if I download a .cif from the PDB website or from the AlphaFold DB, then it gets parsed properly by https://github.com/dauparas/ProteinMPNN/blob/main/training/parse_cif_noX.py#L264

however my data is in .pdb format, if I convert the pdb files to .cif (e.g. using BioPython or PyMol), then parse_mmcif returns empty dict for chains and metadata ...

Any help to prepare training data from .pdb files please? Or what is a compatible conversion to .cif to be able to process my dataset with parse_mmcif?

hjistb commented 1 week ago

I think we can use this function to parse the pdb files, and the output could be input to StructureDataset function here.

adrienchaton commented 1 week ago

thanks! I will look into that and see if I can customize the training script to load directly from pdb files