Closed onlyonewater closed 2 years ago
Hi, @onlyonewater.
Since we derive structures for DIPS-Plus from the RCSB PDB using its biounit
FTP server directory, the 'original' PDB files come from the PDB as tar.gz
archives. Each of these archives contains PDB files that represent at least one protein chain, which are subsequently combined to form 'binary' (two-chain) complexes. The most direct way to get these 'original' PDB files for DIPS-Plus would most likely be to download the biounit
PDB files directly from the RCSB and then filter them down to only those contained in DIPS-Plus. Our from-scratch
build instructions for DIPS-Plus will walk you through how to do this with rsync
and our extract_raw_pdb_gz_archives.py
and make_dataset.py
scripts (https://github.com/BioinfoMachineLearning/DIPS-Plus#how-to-compile-dips-plus-from-scratch).
Alternatively, if it suites your use case, you may also consider downloading our provided raw .dill
files that contain the atom-structural details for each DIPS-Plus complex. These .dill
files can be downloaded from our Zenodo link for DIPS-Plus (https://zenodo.org/record/5134732/files/final_raw_dips.tar.gz?download=1). If you choose to go this latter route, for reference, below is an example of what one of these .dill
file's inner contents looks like.
I hope this information helps!
I would like to know what information does the .dill
file contain?
@onlyonewater,
As you can see in my screenshot above, these .dill
files contain metadata describing each protein complex in the dataset. For example, each .dill
file contains df0
and df1
keys that respectively store Pandas DataFrames representing Chain 1 and Chain 2 in a binary protein complex. Each of these Pandas DataFrames holds important structural information such as each chain's ATOM entries from their original PDB files. In this way, one can adapt these .dill
files for other related tasks they would like to perform.
ok, I get it, thanks!!
when I want to download the raw PDB files like this rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \ rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ ./raw
, an error shows:
rsync: getaddrinfo: rsync.rcsb.org 33444: Name or service not known rsync error: error in socket IO (code 10) at clientserver.c(126) [Receiver=3.1.2]
@amorehead
ok i get it.
Can you provide the original PDB files for DIPS-PLUS dataset?