Closed rkurchin closed 3 years ago
Merging #55 (97e2a8a) into main (d0796d3) will decrease coverage by
3.42%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## main #55 +/- ##
==========================================
- Coverage 93.44% 90.02% -3.43%
==========================================
Files 6 6
Lines 412 421 +9
==========================================
- Hits 385 379 -6
- Misses 27 42 +15
Impacted Files | Coverage Δ | |
---|---|---|
src/api_sketch.jl | 0.00% <0.00%> (ø) |
|
src/atomfeat.jl | 92.50% <0.00%> (-0.30%) |
:arrow_down: |
src/ChemistryFeaturization.jl |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d0796d3...97e2a8a. Read the comment docs.
Thanks for all this great feedback! I'll get to work on starting to make all this happen this afternoon. Given the realities of Zoom life, I anticipate it will take at least a week until there's anything close to ready to merge in... 🙄
Two more points:
1) somehow committing @DhairyaLGandhi 's suggestion about needing the where
bit on the magical encoding/decoding seems to have made his other comments there disappear (edit: I later found them in the little collapsed "show resolved" thing). The point that the same syntax could lead to surprisingly different results is well-taken. My inclination is to have that syntax be used only for encoding since that will be done a lot more often than decoding in practice. Thoughts?
2) Another thing I haven't addressed here yet and would love input on how to address is the idea of letting users add features to the built-in lookup table atom_data_df
(and have automatically-generated encode_f
and decode_f
accordingly). Any thoughts on what a good structure/syntax for something like that would be? I think it's fine if this capability is not added in the initial restructure and can be kept as an issue to implement in the future, but wanted to at least seed the idea in case anybody had thoughts.
I'm thinking we can get this merged and figure out the details as we move. We have a lot of work set out to do anyway 😃
So if we standardise encode_f
as being a lookup in the general case, which can dispatch on the feature type, users can dispatch on that to write a function which can do arbitrary things and that should be plenty imo.
encode_f(..., f::feat) = atom_data_df[!, f]
encode_f(..., f::myfeat) = ... # user defined arbitrary code which returns a vector
Well it's can't always be a lookup because for contextual features, they have to be calculated on a structure-by-structure basis. I suppose it could be a lookup if we make it so that every predefined feature is automatically precalculated when the structure is initialized and so the encode
just looks up from that internally stored info. That might actually make sense, even though it means unnecessary data gets passed around, because it helps to "quarantine" the Python dependencies into the generation of the atoms objects...
Well if it's not in the DataFrame it would simply error would it not?
I agree reserving the callable syntax for encoding makes sense. I suppose one could also add a second argument/ kwarg to the function that controls what is happening. kind of like flag ? encode : decode
and users could call it as a(args..., true/ false)
in case it is actually desired that the same syntax do the different things. I don't have strong opinions, but thought I'd say it to help disambiguate the user side code.
Well if it's not in the DataFrame it would simply error would it not?
Yes, but what I'm envisioning is a way that the users could add to the DataFrame if they had noncontextual features that they wanted to add. So suppose there's some other feature that's just a property of elements, that isn't part of atom_data_df
, but if it were, then all the lookup stuff for generating encode_f
/decode_f
would work. So the user could somehow edit atom_data_df
or perhaps define an extra DataFrame that the code would also check when trying to do a lookup, or something like that.
I agree reserving the callable syntax for encoding makes sense. I suppose one could also add a second argument/ kwarg to the function that controls what is happening. kind of like
flag ? encode : decode
and users could call it asa(args..., true/ false)
in case it is actually desired that the same syntax do the different things. I don't have strong opinions, but thought I'd say it to help disambiguate the user side code.
Just a quick query here. Wouldn't having the same syntax do things this different from each other violate (or at least subtly overstep) the single responsibility principle?
Well that was the motivation for https://github.com/aced-differentiate/ChemistryFeaturization.jl/pull/55#discussion_r608555225 :p
But if there is an argument that explicitly handles the "mode" then there's no ambiguity. This comment is largely to offer a way out in case it is desired to have that syntax for both encode and decode.
If users add to this DataFrame, wouldn't we also have to write it to disk? Seems a bit fiddly but maybe I'm mistaking something?
maybe the DataFrame is just stored as part of the encode_f
/decode_f
then in that case?
function decode(f::AbstractFeature{Tn,Te}, encoded_f::Te) where {Te,Tn}
We should switch around the arguments here. Just seems more "natural" in Julia. I've also added the where statement
Can you explain what you mean by "natural" here? To me it's more natural to give the thing you're decoding first, and then the thing that tell you how to encode it. I'm open to doing it the other way though.
It's not a hard and fast rule of course, so it's not something we need to stick to, just some conventions. Here, since I am decoding a thing, I think back to convert
or reinterpret
, in which the conversion specification (the type) comes first and the thing I am converting comes later.
I'm with @DhairyaLGandhi on this one. Also, if you're applying an encoding scheme to different features (which is something I think is more likely than applying multiple encoding schemes to a feature), the code starts to look cleaner too, imo.
I've been pondering ways for users to add custom lookup tables for noncontextual atom features...what do people think about having (in utils) a way to build a function as encode_f
that has a Ref
to a DataFrame
, which by default would be the built-in atom_data_df
but the user could supply an alternative one? That seems to be to be a relatively clean way around the problem of trying to have a user modify the built-in DataFrame, and also be a bit easier for this simple case than asking the user to write their own encode_f
from scratch...
@DhairyaLGandhi, @thazhemadam, thoughts?
Merging #55 (88200e8) into main (e9c3a14) will decrease coverage by
35.94%
. The diff coverage is75.81%
.
@@ Coverage Diff @@
## main #55 +/- ##
===========================================
- Coverage 93.36% 57.42% -35.95%
===========================================
Files 5 9 +4
Lines 407 404 -3
===========================================
- Hits 380 232 -148
- Misses 27 172 +145
Impacted Files | Coverage Δ | |
---|---|---|
src/featurizations/weavefeaturization.jl | 0.00% <0.00%> (ø) |
|
src/old/featurize.jl | 0.00% <ø> (ø) |
|
src/old/weave_fcns.jl | 0.00% <ø> (ø) |
|
src/ChemistryFeaturization.jl | 32.00% <32.00%> (ø) |
|
src/atoms/atomgraph.jl | 60.75% <60.75%> (ø) |
|
src/featurizations/graphnodefeaturization.jl | 62.50% <62.50%> (ø) |
|
src/features/atomfeature.jl | 80.76% <80.76%> (ø) |
|
src/utils/atomfeature_utils.jl | 96.55% <96.55%> (ø) |
|
src/utils/graph_building.jl | 97.87% <97.87%> (ø) |
|
... and 8 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e9c3a14...88200e8. Read the comment docs.
For the record, the tests pass locally for me on my Mac, so I think the macOS failure is something to do with the runner and not an actual problem with the package on macOS.
🥁 I've filed issues for various remaining TODO's, so I'm going to merge this! 😬 It's been a saga!
Pretty drastic restructure of how the package works to enable more sharing of core elements between featurization schemes, data types, etc...