BioinfoMachineLearning / DIPS-Plus

The Enhanced Database of Interacting Protein Structures for Interface Prediction
https://zenodo.org/record/5134732
GNU General Public License v3.0
46 stars 8 forks source link

About original PDB files #13

Closed onlyonewater closed 2 years ago

onlyonewater commented 2 years ago

Can you provide the original PDB files for DIPS-PLUS dataset?

amorehead commented 2 years ago

Hi, @onlyonewater.

Since we derive structures for DIPS-Plus from the RCSB PDB using its biounit FTP server directory, the 'original' PDB files come from the PDB as tar.gz archives. Each of these archives contains PDB files that represent at least one protein chain, which are subsequently combined to form 'binary' (two-chain) complexes. The most direct way to get these 'original' PDB files for DIPS-Plus would most likely be to download the biounit PDB files directly from the RCSB and then filter them down to only those contained in DIPS-Plus. Our from-scratch build instructions for DIPS-Plus will walk you through how to do this with rsync and our extract_raw_pdb_gz_archives.py and make_dataset.py scripts (https://github.com/BioinfoMachineLearning/DIPS-Plus#how-to-compile-dips-plus-from-scratch).

Alternatively, if it suites your use case, you may also consider downloading our provided raw .dill files that contain the atom-structural details for each DIPS-Plus complex. These .dill files can be downloaded from our Zenodo link for DIPS-Plus (https://zenodo.org/record/5134732/files/final_raw_dips.tar.gz?download=1). If you choose to go this latter route, for reference, below is an example of what one of these .dill file's inner contents looks like.

image

I hope this information helps!

onlyonewater commented 2 years ago

I would like to know what information does the .dill file contain?

amorehead commented 2 years ago

@onlyonewater,

As you can see in my screenshot above, these .dill files contain metadata describing each protein complex in the dataset. For example, each .dill file contains df0 and df1 keys that respectively store Pandas DataFrames representing Chain 1 and Chain 2 in a binary protein complex. Each of these Pandas DataFrames holds important structural information such as each chain's ATOM entries from their original PDB files. In this way, one can adapt these .dill files for other related tasks they would like to perform.

onlyonewater commented 2 years ago

ok, I get it, thanks!!

onlyonewater commented 2 years ago

when I want to download the raw PDB files like this rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \ rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ ./raw, an error shows: rsync: getaddrinfo: rsync.rcsb.org 33444: Name or service not known rsync error: error in socket IO (code 10) at clientserver.c(126) [Receiver=3.1.2]

onlyonewater commented 2 years ago

@amorehead

onlyonewater commented 2 years ago

ok i get it.