dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
934 stars 284 forks source link

training with a custom dataset #54

Open adrienchaton opened 1 year ago

adrienchaton commented 1 year ago

Hi @dauparas

Thanks for the good job done, I am already using ProteinMPNN and would like to train or finetune it on a custom dataset. Are there some scripts on how to process PDB files into the format compatible with your training script please?

PDBID_CHAINID.pt - I am not sure how to create the values for mask, bfac and occ.

And some metadata will not always be available since I would like to use a mixture of experimental and predicted structures at the input ... Can some metadata be missing from the training files?

Thanks for any hints!

lcesaire3 commented 2 months ago

https://github.com/dauparas/ProteinMPNN/blob/main/training/parse_cif_noX.py could help?

MarjanHJ commented 1 month ago

@adrienchaton I was wondering if you found an answer to this? I also want to train the network on custom data and having issue in preparing the data for training. My main issue is the size of xyz which is - atomic coordinates [L,14,3] and I cannot figure where the 14 is coming from and how to populate that. I looked in the file https://github.com/dauparas/ProteinMPNN/blob/main/training/parse_cif_noX.py but this was not helpful

anar-rzayev commented 3 weeks ago

Any updates?

adrienchaton commented 3 weeks ago

Hi all and thanks @lcesaire3 for the pointer.

@MarjanHJ if you look at here https://github.com/dauparas/ProteinMPNN/blob/main/training/parse_cif_noX.py#L45 You see that amongst the 20 natural AAs, TRP has the most number of heavy atoms, which is 14 and I assume that is why input features allows up to (14,3) coordinates per residue ...

I am actually getting again interested in finetuning pMPNN, so I will try to work out the data preparation of my PDBs of interest and hopefully can run training codes then ...

Thanks for sharing, interested if anyone else had progresses on this topic, right now I dont have more to share ...

hjistb commented 1 week ago

I think we can use this function to parse the pdb files, and the output could be input to StructureDataset function here.