joelselvaraj commented 7 years ago

I have been working on few enhancements over handling PDB files based on BioPython library to be true. Such as:

Downloading PDB file to a specific directory instead of only current directory
Skip or Overwrite if the PDB file already exists in the directory
Downloading Entire PDB database
Downloading multiple PDB files by passing a list of PDB IDs
Downloading all obsolete PDB entries from RCSB Server
Updating a PDB directory based on weekly status file from RCSB Server
Retrieve a PDB file (i.e. it will download the PDB file if it does not exists and read them)
Obsolete PDB file handling (A directory named "obsolete" is automatically created inside the specified "pdb_dir" to store and handle them)

jgreener64 commented 7 years ago

Great, thanks for making this PR. I will have a look at this tomorrow.

joelselvaraj commented 7 years ago

We have few things to discuss.

1. Whether the out_filepath argument in downloadpdb() is required? Because removing it will give an uniformity in the code and will be easy for users. It would be better to keep the PDB ID as file name so that it will be easy to handle it. The user can specify a different structure_name when reading the file to differentiate the file as he wishes.

2. Should we write test cases? The code has grown little complex than before. Especially there are lots of options and different combinations. This is may lead to unexpected bugs.

3. Should getstatuslist() function be exported in the module? As of now its just a helper function for getrecentchanges() function.

jgreener64 commented 7 years ago

So overall looks good, thanks for adding this useful feature. There are a couple more things but the above is most of my issues. For reference the Biopython implementation is here: https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBList.py

In answer to your specific questions:

jgreener64 commented 7 years ago

out_filepath in my initial implementation was generally meant for the user to specify a directory rather than filename. Since you are adding the extra directory argument you can remove out_filepath for uniformity of filenames.
Yes, there should be some tests. These will rely on an internet connection but I think that is okay. Obviously we cannot download the whole PDB in the tests but certainly on the other functions.
Maybe don't export it, if we do then there should be information on its use in the docstring.

jgreener64 commented 7 years ago

There is one more thing that could be added now you are looking at this code, namely download of mmCIF and MMTF files from the PDB. Since mmCIF is now the standard, this is an important feature (I am writing an mmCIF parser for Bio.Structure now too). Would you be okay to implement this as a file_format (or similar) keyword argument to downloadpdb?

joelselvaraj commented 7 years ago

@jgreener64 Thank you so much for your review. I will update the code accordingly in the future commits. Adding option file_format in downloadpdb will be useful. Nice that you are working on mmCIF parser. We can discuss further changes as the code grows.

joelselvaraj commented 7 years ago

As discussed I have made all the recommended changes. Also added option to download and update files in PDB, XML and mmCIF formats. Let me know if there is something I missed or to be modified or added. I didn't still write test cases. I thought of refining the code before writing test cases.

jgreener64 commented 7 years ago

Thanks @joels94 , I will look at this tomorrow or Monday.

joelselvaraj commented 7 years ago

PDB, XML, mmCIF files are now downloaded in compressed format to reduce internet usage.
PDB files can now be downloaded and updated in MMTF format also.

codecov-io commented 7 years ago

Codecov Report

Merging #483 into master will increase coverage by 0.6%. The diff coverage is 80%.

@@            Coverage Diff            @@
##           master     #483     +/-   ##
=========================================
+ Coverage   70.34%   70.94%   +0.6%     
=========================================
  Files          34       34             
  Lines        2421     2537    +116     
=========================================
+ Hits         1703     1800     +97     
- Misses        718      737     +19

Impacted Files	Coverage Δ
src/structure/pdb.jl	`89.22% <80%> (-5.61%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e08f2a3...22231b3. Read the comment docs.

joelselvaraj commented 7 years ago

I have written the test cases. I have also made few changes to the code. Kindly take a look at it and let me know if any changes are required.

joelselvaraj commented 7 years ago

TO DO

[x] Update docstrings
[x] Write test cases
[x] Add documentation

Finally, completed writing the documentation. Let me if any changes are required before merging the code.

jgreener64 commented 7 years ago

On further thought, should the mmCIF type be called MMCIF? I think the Julia convention of capital types might supersede the technical name, and also this would bring it in line with MMTF where the MMs mean the same thing.

joelselvaraj commented 7 years ago

Updated docs and code as discussed. Let me if further changes are required.

joelselvaraj commented 7 years ago

I have updated the docs. Let me know if any changes are required.

jgreener64 commented 7 years ago

Great, I'm happy with this to go in. Thanks for all the work. I will wait until tomorrow in case anyone else has any comments, then I'll merge.

BioJulia / Bio.jl

[WIP] PDB file handling enhancements #483

Codecov Report