collinarnett / protein_gan

Implementation of "Generative Modeling for Protein Structures" by Namrata Anand and Po-Ssu Huang
GNU General Public License v3.0
18 stars 6 forks source link

Downloading Data #1

Closed collinarnett closed 4 years ago

collinarnett commented 4 years ago

Figuring out how to download data is the first issue faced when trying to implement this paper. The original authors noted in section 3.1 that the dataset was obtained from Protein Data Bank and they also noted that the data downloaded was the 3D structure of the protein so that leads me to believe that the authors used the download option on the PDB website.

Although they haven't specified which download option they decided on we can infer that they used the Coordinates & Experimental Data option with the Structural Factors box ticked from their wording:

We chose to encode 3Dstructure as 2D pairwise distances between↵-carbons on the protein backbone. This representation does not preserve information about the protein sequence (side chains) or the torsion angles of thepolypeptide backbone, but preserves enough information to allow for structure recover

However they do include the train and test data in their supplementary files listed here

collinarnett commented 4 years ago

The supplemental files are as follows:

model_architectures.txt
supplement.pdf
test_ids.txt
train_ids.txt

Now we can focus on automating the download process using the provided ids before figuring out the model architecture or the supplement pdf.

collinarnett commented 4 years ago

here is a very helpful link in easily downloading these files programatically.

collinarnett commented 4 years ago

Finished downloading the data and uploaded the data to my personal AWS for easy downloads later. TODO: Looking for a place to publicly host the compressed train and test set.