adaptyvbio / ProteinFlow

Versatile computational pipeline for processing protein structure data for deep learning applications.
https://adaptyvbio.github.io/ProteinFlow/
BSD 3-Clause "New" or "Revised" License
229 stars 14 forks source link

Calling ProteinEntry.from_pickle(<path>).to_pdb(<target_path>) on the entire dataset reveals errors #132

Open ardagoreci opened 8 months ago

ardagoreci commented 8 months ago

Hi Liza,

I noticed when trying to create a W&B table visualization for the entire dataset that converting the pickle files into pdbs reveals multiple bugs.

Firstly, I got a "UnpicklingError: unpickling stack underflow" from the line "protein_entry = ProteinEntry.from_pickle(pickle_path)" It did not happen with every protein, so when I handled that exception I realized that PDBParser could not properly parse a few of the generate pdb files, throwing out an error in the line "structure = parser.get_structure(pdb_id, target_path)"

Screen Shot 2024-01-27 at 23 26 16 Screen Shot 2024-01-27 at 23 26 55
elkoz commented 7 months ago

Could you please attach an example file that this fails on or say its pdb id @ardagoreci ?

elkoz commented 7 months ago

This code ran for me without any errors.

from proteinflow.data import ProteinEntry
from tqdm import tqdm
import os

folder = "data/proteinflow_20230102_stable/train"

for filename in tqdm(os.listdir(folder)):
    ProteinEntry.from_pickle(os.path.join(folder, filename)).to_pdb("tmp.pdb")