awfderry / COLLAPSE

Representation learning for protein functional site analysis
MIT License
8 stars 2 forks source link

Prepare the pickle file using a subset of PDB files? #11

Open BinhongLiu opened 1 year ago

BinhongLiu commented 1 year ago

Hi! Could I prepare the pickle file using a subset of PDB files? And so that I could search for a functional site against the subset of the PDB database. I tried the embed_pdb_dataset.py script, but only the LMDB format file was produced. Thanks!

BinhongLiu commented 1 year ago

Hi, Sorry to bother you again. I'd like to prepare a pickle file containing residues embedding database with data from a pocket database (http://bioinfo-pharma.u-strasbg.fr/scPDB/) and then annotate my protein structures with this pocket database using annotate_pdb.py. It seems to be that I need to prepare the pickle file and background_stats.tar.gz, right? Would you help me with this? Thanks

awfderry commented 1 year ago

Hi, you can use the script lmdb_to_pkl.py to convert the LMDB format to pickle format. For large datasets, it may help to process this in multiple splits and combine the resulting pickle files (using the --split_id and --num_splits arguments).

awfderry commented 1 year ago

For a custom database such as scPDB, you can create the dataset using scripts/functional_database.py (you may have to update the dataset class in line 23 from SiteDataset to accommodate your specific data format. You can either use the pre-computed background embeddings from PDB100 (recommended as a starting point) or you can compute your own background distributions using a dataset of PDB files using scripts/compute_background.py.