Format Data - Githubissues

collinarnett / protein_gan

Implementation of "Generative Modeling for Protein Structures" by Namrata Anand and Po-Ssu Huang

GNU General Public License v3.0

18 stars 6 forks source link

Format Data #2

Closed collinarnett closed 4 years ago

collinarnett commented 4 years ago

Now that we have all the data the next thing to do is ingest the data into our model. Since the dataset is very large (98GB) this will be considerably challenging. The data must also be transformed into a "map" as referred to by the authors:

We chose to encode 3Dstructure as 2D pairwise distances between↵-carbons on the protein backbone. This representation does not preserve information about the protein sequence (side chains) or the torsion angles of the polypeptide backbone, but preserves enough information to allow for structure recovery. We refer henceforth to these pairwise↵-carbon distance matrices as "maps."

This will also be challenging, as this is the only information provided by the original authors on how they preprocessed their data.

collinarnett commented 4 years ago

Useful links: PDB Data Structure http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction

Biopython looks promising for parsing PDB files https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

Pretty sure this will create our maps. Needs to be tested https://github.com/wes-kosater/PDB-Distance-Finder/blob/master/PDB-Distance-Finder.py

collinarnett commented 4 years ago

Downloaded distance finder and explored it for a bit. Seems to extract the data we need correctly. Still need to figure out how to feed the data to the GAN.

collinarnett commented 4 years ago

I think the current Jupyter notebook generates what we need but the paper also mentions:

non-overlapping fragments of lengths 16, 64, and 128

So now I have to figure out how to make these.

collinarnett commented 4 years ago

Trying to convert the train and test numpy arrays to a hdf5 file for easier read when feeding to the model. The current barrier is the train set takes so long to generate since there are 100,000 plus files to iterate through. I've used python's pool library to make the process faster but that doesn't seem to have helped much. See screen shot

The other problem is that I've noticed that there are different number of samples depending on the map resolution which means there is a error in the function written for map investigation. Will have to look into that after building the initial hdf5 file.

collinarnett commented 4 years ago

Fixed the issue with pool not working and now all cores are used. When using a jupyter notebook the child processes of the main function don't garbage collect properly so I don't know how to fix that issue but I will fix it at some point.

Completed dataset creation although there are a few missing maps I'm not sure what to do about them. I think my script is complete in its functionality although it could be better in terms of memory usage so this issue is done for now.