Chemellia / ChemistryFeaturization.jl

Interface package for featurizing atomic structures
https://chemistryfeaturization.chemellia.org/dev/
MIT License
41 stars 14 forks source link

pair and bond features #127

Closed rkurchin closed 2 years ago

rkurchin commented 2 years ago

This is the other major piece needed for WeaveModel (and a variety of other featurizers too). The idea here in the short-term will be to set up PairFeatureDescriptor and BondFeatureDescriptor very similarly to the way SpeciesFeatureDescriptor is, with fleshed-out versions just for MolecularGraph to start with.

A lot of parts of this functionality will be able to be cleaner once we're using AtomsBase for structure representations since we'll be able to have some guarantees on the way atom indexing works, etc., but those cleanups will have to come later since I doubt that interface will be stable for some time yet.

Another thing I'd like to add is a check in the BFD for whether two atoms are actually bonded, but again, writing that machinery in a generic way will be much easier once we have a consistent structure interface, so may just skip it for now and assume that MolecularGraph's errors will be clear enough.

For data structure, probably a return type from get_value of Symmetric would make sense to me (one value for every pair of atoms). We might want to make it sparse for the BFD case.

codecov[bot] commented 2 years ago

Codecov Report

Merging #127 (1ffd6a8) into main (0dd3b61) will decrease coverage by 5.45%. The diff coverage is 40.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #127      +/-   ##
==========================================
- Coverage   86.19%   80.73%   -5.46%     
==========================================
  Files          18       20       +2     
  Lines         507      571      +64     
==========================================
+ Hits          437      461      +24     
- Misses         70      110      +40     
Impacted Files Coverage Δ
src/ChemistryFeaturization.jl 80.00% <ø> (ø)
src/data.jl 0.00% <0.00%> (ø)
src/features/bondfeature.jl 0.00% <0.00%> (ø)
src/features/features.jl 88.88% <0.00%> (-11.12%) :arrow_down:
src/features/pairfeature.jl 0.00% <0.00%> (ø)
src/features/speciesfeature.jl 76.19% <60.00%> (+3.46%) :arrow_up:
src/codecs/onehotonecold.jl 84.21% <76.47%> (-11.79%) :arrow_down:
src/features/elementfeature.jl 85.00% <100.00%> (+4.04%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0dd3b61...1ffd6a8. Read the comment docs.

rkurchin commented 2 years ago

Okay, that makes sense, but given that featurization won’t generally have to be performant, is it worth adding more type parameters to every FD object? Seems like cluttering things up a lot…also, presumably if it’s an abstract type from within this package, the code generation isn’t that bad since the number of subtypes is pretty small…

————————————— Rachel C. Kurchin, PhD Postdoctoral Researcher (formerly MolSSI and MFI Fellow) Carnegie Mellon Mechanical Engineering and Materials Science https://rkurchin.github.io On Nov 3, 2021, 10:57 PM -0400, Dhairya Gandhi @.***>, wrote:

@DhairyaLGandhi commented on this pull request. In src/features/bondfeature.jl:

+ +A feature object that encodes features associated with bonded pairs of atoms in a structure. For pair features, see PairFeatureDescriptor. + +Type parameter represents the structure representation(s) from which this feature descriptor is able to compute features. + +## Fields +- name::String: the name of the feature +- compute_f: function that takes in a structure of type A and returns values of this feature for every atom in the structure +- encoder_decoder::AbstractCodec: codec that encodes/decodes values of this feature +- categorical::Bool: flag for whether the feature is categorical or continuous-valued +- encodable_elements::Vector{String}: list of elements (by symbol) that can be encoded by this feature +""" +struct BondFeatureDescriptor{A} <: AbstractPairFeatureDescriptor

rkurchin commented 2 years ago

Just realized bond features are now duplicated with somewhat different approaches to implementation. @thazhemadam, would appreciate your thoughts here. While I understand the trait-like approach you had sketched out, I tend to think doing it more analogously to EFD, SFD, etc. makes more sense because it's a bit more generalizable. e.g. if we consider a "bond" to be any nodes that share a nonzero edge (at least in the AtomGraph context) then that may not be compatible with specifying if it's specifically a single, double, etc. as those things are typically only well-defined in the molecular context. I would rather have convenience constructors for BFD's (also, LOLOL that that becomes the abbreviation for this) for bond type that accomplish this, perhaps even with aliases of the types to make the functionality look similar to what you already have. Anyway, LMK what you think.

thazhemadam commented 2 years ago

While I understand the trait-like approach you had sketched out, I tend to think doing it more analogously to EFD, SFD, etc. makes more sense because it's a bit more generalizable

Only "universal and categorical" features need to be expressed using traits. I say "categorical" since it is essentially boolean in this case - you're either a double bond or you're not. I say "universal" since every bond has to have a BondType. Since there aren't a whole lot of options, and since these traits are generally universal features I feel like this might be a cleaner approach here.

I would rather have convenience constructors for BFD's (also, LOLOL that that becomes the abbreviation for this) for bond type that accomplish this.

How would you know what type would apply here in the convenience constructor? If you already know what bond type is applicable beforehand, couldn't you just as easily initialize it using traits and dispatch on it using BondTypes?

rkurchin commented 2 years ago

Only "universal and categorical" features need to be expressed using traits. I say "categorical" since it is essentially boolean in this case - you're either a double bond or you're not. I say "universal" since every bond has to have a BondType. Since there aren't a whole lot of options, and since these traits are generally universal features I feel like this might be a cleaner approach here.

My point prior was that if we take the notion of bond that I see as most compatible with graph representations, it's not actually universal in the way you've defined it. For molecules, perhaps, since it's generally well-posed to infer bond order with some fairly simple heuristics (and probably graph weights would be best represented as integers in that case anyway), but in a solid, not so.

I also think overall there's better cognitive fluency if the syntax and structure looks similar, so e.g. rather than having to initialize some other type like BondOrder, you'd just do a convenience constructor like BondFeatureDescriptor("bond_order"), analogous to what I did for SFD's and EFD's. Does that make sense?

DhairyaLGandhi commented 2 years ago

Could you sketch out in terms of current syntax or even pseudo-syntax what you think would make for intuitive syntax for constructing molecules? Maybe something that involves building smaller structures (an analog to functional groups, perhaps) and sticking them together to form the full molecule? I imagine if we can describe sites of "linkages" in the substructure, we could also represent crystals with it.

rkurchin commented 2 years ago

I’m not sure what that has to do with this PR? From my perspective the “molecule construction” is done when MolecularGraph reads in a SMILES string. The code here is for representing features of that molecule (and potentially solids etc. as well)

————————————— Rachel C. Kurchin, PhD Postdoctoral Researcher (formerly MolSSI and MFI Fellow) Carnegie Mellon Mechanical Engineering and Materials Science https://rkurchin.github.io On Nov 9, 2021, 7:50 AM -0500, Dhairya Gandhi @.***>, wrote:

Could you sketch out in terms of current syntax or even pseudo-syntax what you think would make for intuitive syntax for constructing molecules? Maybe something that involves building smaller structures (an analog to functional groups, perhaps) and sticking them together to form the full molecule? I imagine if we can describe sites of "linkages" in the substructure, we could also represent crystals with it. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

DhairyaLGandhi commented 2 years ago

Ok oh, I should have said representing the features instead.

rkurchin commented 2 years ago

Sure, so as I imagine it, it would be basically identical to the EFD and SFD case, but you'd need to return matrices instead of vectors because it takes two atoms to specify these features:

# for the bond case
bfd_bo = BondFeatureDescriptor("bond_order")
get_value(bfd_bo, atomgraph) # this should return a (perhaps sparse) matrix of bond orders that is defined only at pairs of indices that share bonds
encode(bfd_bo, atomgraph) # this works similarly

# for the pair case
pfd_dist = PairFeatureDescriptor("distance")
get_value(pfd_dist, atomgraph) # this will return a distance matrix defined at every pair of indices

Does this help?

DhairyaLGandhi commented 2 years ago

Julia should be able to figure out the types just fine. We might be missing a couple of exports per CI.

rkurchin commented 2 years ago

Fixed the export thing, going to add a couple more OneHotOneCold tests to cover the higher-dimensional case and also make sure that the types work fine when I take out those annotations, then will merge this.