ericmjl commented 3 years ago

As per PR title.

Tagging @a-r-j and @rvinas.

I'm going to be developing a few starter tests here for you all to take a look at and mimic in testing the rest of the codebase. As with all code, keeping tests simple, readable, and explicit should take priority over having complex test suites. Test coverage can help identify which blocks of code are not explicitly tested, and I have enabled that in the test configuration file.

When testing, keep in mind that tests should be easy to map back to their original functions. As such, the test suite organization should largely mimic the source code organization.

ericmjl commented 3 years ago

Okay, @a-r-j and @rvinas! I went back and forth this evening on what else I could demonstrate for testing, but I think I'm hitting a mental wall right now.

Testing generally follows the pattern of providing expected inputs to a function and testing that the output has certain correctness properties. We deliberately ignore the function's implementation; as long as the function returns the correct outputs for a given input, we're gold.

Testing a graph package is difficult because graphs are high dimensional objects that can be created in combinatorial ways. As such, the compromise we can settle on is to use example-based tests. I adopted this approach in the test for test_construct_graph, which uses the function that I'm most familiar with at the moment.

You'll notice in the build logs that there is test coverage information provided as well. That will tell you exactly which lines of code have not been executed by a test. At the minimum, one should strive to cover as close to 100% of the codebase as possible. With the one test I wrote, I was able to cover 69% of graphein.protein.graphs and a few other modules that get called on. The challenge here is to now fill in the rest of the tests. I'm not quite familiar with some of the missing ones, so @a-r-j, I might have to rely on your intuition here to fill them out. Btw, some of the example-based tests you have at the bottom of your files are prime candidates to be included in the test suite. You could leverage those examples as starting points for writing tests; just remember to add in assertions about certain properties that ought to be guaranteed!

If you test the library, then others will have confidence in adopting its usage. Your impact will go way beyond what a paper could be! I'll work on making tests run automatically in the rest of this PR, and then it'll be ready to merge -- the infrastructure for automatic testing will be up and running.

ericmjl commented 3 years ago

@a-r-j ready to review!

codecov-io commented 3 years ago

Codecov Report

:exclamation: No coverage uploaded for pull request base (graphein-api@a57f5cd). Click here to learn what that means. The diff coverage is n/a.

@@               Coverage Diff               @@
##             graphein-api      #45   +/-   ##
===============================================
  Coverage                ?   19.33%           
===============================================
  Files                   ?       25           
  Lines                   ?     1583           
  Branches                ?        0           
===============================================
  Hits                    ?      306           
  Misses                  ?     1277           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a57f5cd...6c946d9. Read the comment docs.

ericmjl commented 3 years ago

We're at doctoring coverage now! @a-r-j is there one of the functions that could have some additional docstrings added? (For reference, here's the summary.)

a-r-j commented 3 years ago

Hey @ericmjl, yep we’re getting there 😁 ! Docs are on the horizon.

The plan for tomorrow is:

Docs coverage (as close 100% as sanity allows)
Feature Pre-processing

Re: pre-processing I was thinking of a setup where users can pass a few standard functions (one-hot, mean, etc) to a dictionary/config as well as fitted sklearn scalers for normalising across graphs in a dataset (with some helpful functions for creating them from a list of graphs) or unfitted sklearn scalers (for single graph normalisation). A config object seems nice and consistent, but a dictionary might be better for users that create their own features as a config object would be a bit inflexible there.

Would be super keen to hear any thoughts/suggestions on this.

pre_processing_dict = {
“molecular_weight”: StandardScaler,
“secondary_structure”: partial(one_hot, vocab = SS_ELEMENTS)
}

G = process_graph(G, pre_processing_dict)

After this, I think the only outstanding task is a conversion toolkit to support various frameworks. Then it’s cleaning up & polishing before I think a V2.0 release is in order :D EDIT: and tests! Will crack on with them this week.

a-r-j / graphein

[TST] Add tests for protein graphs #45

Codecov Report