Closed collinarnett closed 4 years ago
Useful links: PDB Data Structure http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction
Biopython looks promising for parsing PDB files https://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
Pretty sure this will create our maps. Needs to be tested https://github.com/wes-kosater/PDB-Distance-Finder/blob/master/PDB-Distance-Finder.py
Downloaded distance finder and explored it for a bit. Seems to extract the data we need correctly. Still need to figure out how to feed the data to the GAN.
I think the current Jupyter notebook generates what we need but the paper also mentions:
non-overlapping fragments of lengths 16, 64, and 128
So now I have to figure out how to make these.
Trying to convert the train and test numpy arrays to a hdf5 file for easier read when feeding to the model. The current barrier is the train set takes so long to generate since there are 100,000 plus files to iterate through. I've used python's pool
library to make the process faster but that doesn't seem to have helped much. See screen shot
The other problem is that I've noticed that there are different number of samples depending on the map resolution which means there is a error in the function written for map investigation. Will have to look into that after building the initial hdf5 file.
Fixed the issue with pool not working and now all cores are used. When using a jupyter notebook the child processes of the main function don't garbage collect properly so I don't know how to fix that issue but I will fix it at some point.
Completed dataset creation although there are a few missing maps I'm not sure what to do about them. I think my script is complete in its functionality although it could be better in terms of memory usage so this issue is done for now.
Now that we have all the data the next thing to do is ingest the data into our model. Since the dataset is very large (98GB) this will be considerably challenging. The data must also be transformed into a "map" as referred to by the authors:
This will also be challenging, as this is the only information provided by the original authors on how they preprocessed their data.