Chemellia / ChemistryFeaturization.jl

Interface package for featurizing atomic structures
https://chemistryfeaturization.chemellia.org/dev/
MIT License
41 stars 14 forks source link

Restructure #55

Closed rkurchin closed 3 years ago

rkurchin commented 3 years ago

Pretty drastic restructure of how the package works to enable more sharing of core elements between featurization schemes, data types, etc...

codecov-io commented 3 years ago

Codecov Report

Merging #55 (97e2a8a) into main (d0796d3) will decrease coverage by 3.42%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #55      +/-   ##
==========================================
- Coverage   93.44%   90.02%   -3.43%     
==========================================
  Files           6        6              
  Lines         412      421       +9     
==========================================
- Hits          385      379       -6     
- Misses         27       42      +15     
Impacted Files Coverage Δ
src/api_sketch.jl 0.00% <0.00%> (ø)
src/atomfeat.jl 92.50% <0.00%> (-0.30%) :arrow_down:
src/ChemistryFeaturization.jl

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d0796d3...97e2a8a. Read the comment docs.

rkurchin commented 3 years ago

Thanks for all this great feedback! I'll get to work on starting to make all this happen this afternoon. Given the realities of Zoom life, I anticipate it will take at least a week until there's anything close to ready to merge in... 🙄

Two more points:

1) somehow committing @DhairyaLGandhi 's suggestion about needing the where bit on the magical encoding/decoding seems to have made his other comments there disappear (edit: I later found them in the little collapsed "show resolved" thing). The point that the same syntax could lead to surprisingly different results is well-taken. My inclination is to have that syntax be used only for encoding since that will be done a lot more often than decoding in practice. Thoughts?

2) Another thing I haven't addressed here yet and would love input on how to address is the idea of letting users add features to the built-in lookup table atom_data_df (and have automatically-generated encode_f and decode_f accordingly). Any thoughts on what a good structure/syntax for something like that would be? I think it's fine if this capability is not added in the initial restructure and can be kept as an issue to implement in the future, but wanted to at least seed the idea in case anybody had thoughts.

DhairyaLGandhi commented 3 years ago

I'm thinking we can get this merged and figure out the details as we move. We have a lot of work set out to do anyway 😃

So if we standardise encode_f as being a lookup in the general case, which can dispatch on the feature type, users can dispatch on that to write a function which can do arbitrary things and that should be plenty imo.

encode_f(..., f::feat) = atom_data_df[!, f]

encode_f(..., f::myfeat) = ... # user defined arbitrary code which returns a vector
rkurchin commented 3 years ago

Well it's can't always be a lookup because for contextual features, they have to be calculated on a structure-by-structure basis. I suppose it could be a lookup if we make it so that every predefined feature is automatically precalculated when the structure is initialized and so the encode just looks up from that internally stored info. That might actually make sense, even though it means unnecessary data gets passed around, because it helps to "quarantine" the Python dependencies into the generation of the atoms objects...

DhairyaLGandhi commented 3 years ago

Well if it's not in the DataFrame it would simply error would it not?

DhairyaLGandhi commented 3 years ago

I agree reserving the callable syntax for encoding makes sense. I suppose one could also add a second argument/ kwarg to the function that controls what is happening. kind of like flag ? encode : decode and users could call it as a(args..., true/ false) in case it is actually desired that the same syntax do the different things. I don't have strong opinions, but thought I'd say it to help disambiguate the user side code.

rkurchin commented 3 years ago

Well if it's not in the DataFrame it would simply error would it not?

Yes, but what I'm envisioning is a way that the users could add to the DataFrame if they had noncontextual features that they wanted to add. So suppose there's some other feature that's just a property of elements, that isn't part of atom_data_df, but if it were, then all the lookup stuff for generating encode_f/decode_f would work. So the user could somehow edit atom_data_df or perhaps define an extra DataFrame that the code would also check when trying to do a lookup, or something like that.

thazhemadam commented 3 years ago

I agree reserving the callable syntax for encoding makes sense. I suppose one could also add a second argument/ kwarg to the function that controls what is happening. kind of like flag ? encode : decode and users could call it as a(args..., true/ false) in case it is actually desired that the same syntax do the different things. I don't have strong opinions, but thought I'd say it to help disambiguate the user side code.

Just a quick query here. Wouldn't having the same syntax do things this different from each other violate (or at least subtly overstep) the single responsibility principle?

DhairyaLGandhi commented 3 years ago

Well that was the motivation for https://github.com/aced-differentiate/ChemistryFeaturization.jl/pull/55#discussion_r608555225 :p

But if there is an argument that explicitly handles the "mode" then there's no ambiguity. This comment is largely to offer a way out in case it is desired to have that syntax for both encode and decode.

DhairyaLGandhi commented 3 years ago

If users add to this DataFrame, wouldn't we also have to write it to disk? Seems a bit fiddly but maybe I'm mistaking something?

rkurchin commented 3 years ago

maybe the DataFrame is just stored as part of the encode_f/decode_f then in that case?

rkurchin commented 3 years ago
function decode(f::AbstractFeature{Tn,Te}, encoded_f::Te) where {Te,Tn}

We should switch around the arguments here. Just seems more "natural" in Julia. I've also added the where statement

Can you explain what you mean by "natural" here? To me it's more natural to give the thing you're decoding first, and then the thing that tell you how to encode it. I'm open to doing it the other way though.

DhairyaLGandhi commented 3 years ago

It's not a hard and fast rule of course, so it's not something we need to stick to, just some conventions. Here, since I am decoding a thing, I think back to convert or reinterpret, in which the conversion specification (the type) comes first and the thing I am converting comes later.

thazhemadam commented 3 years ago

I'm with @DhairyaLGandhi on this one. Also, if you're applying an encoding scheme to different features (which is something I think is more likely than applying multiple encoding schemes to a feature), the code starts to look cleaner too, imo.

rkurchin commented 3 years ago

I've been pondering ways for users to add custom lookup tables for noncontextual atom features...what do people think about having (in utils) a way to build a function as encode_f that has a Ref to a DataFrame, which by default would be the built-in atom_data_df but the user could supply an alternative one? That seems to be to be a relatively clean way around the problem of trying to have a user modify the built-in DataFrame, and also be a bit easier for this simple case than asking the user to write their own encode_f from scratch...

@DhairyaLGandhi, @thazhemadam, thoughts?

codecov-commenter commented 3 years ago

Codecov Report

Merging #55 (88200e8) into main (e9c3a14) will decrease coverage by 35.94%. The diff coverage is 75.81%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #55       +/-   ##
===========================================
- Coverage   93.36%   57.42%   -35.95%     
===========================================
  Files           5        9        +4     
  Lines         407      404        -3     
===========================================
- Hits          380      232      -148     
- Misses         27      172      +145     
Impacted Files Coverage Δ
src/featurizations/weavefeaturization.jl 0.00% <0.00%> (ø)
src/old/featurize.jl 0.00% <ø> (ø)
src/old/weave_fcns.jl 0.00% <ø> (ø)
src/ChemistryFeaturization.jl 32.00% <32.00%> (ø)
src/atoms/atomgraph.jl 60.75% <60.75%> (ø)
src/featurizations/graphnodefeaturization.jl 62.50% <62.50%> (ø)
src/features/atomfeature.jl 80.76% <80.76%> (ø)
src/utils/atomfeature_utils.jl 96.55% <96.55%> (ø)
src/utils/graph_building.jl 97.87% <97.87%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e9c3a14...88200e8. Read the comment docs.

rkurchin commented 3 years ago

For the record, the tests pass locally for me on my Mac, so I think the macOS failure is something to do with the runner and not an actual problem with the package on macOS.

🥁 I've filed issues for various remaining TODO's, so I'm going to merge this! 😬 It's been a saga!