Chemellia / ChemistryFeaturization.jl

Interface package for featurizing atomic structures
https://chemistryfeaturization.chemellia.org/dev/
MIT License
41 stars 14 forks source link

actual SpeciesFeature support #118

Closed rkurchin closed 2 years ago

rkurchin commented 2 years ago

resolves #72

Just starting to sketch this out...

rkurchin commented 2 years ago

Going to drop some other thoughts here as they come up...

Fairly minor point, but a change that I think could help clarify things for some people: should we change the "callable" action of an FD on a structure to be raw rather than encoded value? Since it's very unlikely anybody will use that signature to actually encode things anyway, itโ€™s probably more useful/intuitive to be able to get a human-readable value out of that, especially since we will now have more separated notions of computing a value and encoding it since the computation can be less trivial than a lookup table.

DhairyaLGandhi commented 2 years ago

We also would need to featurize more than just atoms too.

rkurchin commented 2 years ago

For Weave yeah, we need BondFeatureDescriptor actually implemented too, but let's do one thing at a time. ;) Once we sort this stuff out, that one should fall into place rather cleanly.

rkurchin commented 2 years ago

Are we removing the AbstractAtoms at

https://github.com/Chemellia/ChemistryFeaturization.jl/blob/aa7d310bc427cb6a2f20e6e5d09ae0bfe37baea8/src/features/features.jl#L32

?

I guess that's a fair point; if we do what I suggested, we would have to remove that dispatch of encode...which I guess might actually get used internally sometimes...in either case, I think it would be useful to explicitly have a function to get the value of a feature vs. the encoded value since in many cases they're different. @thazhemadam, thoughts?

What is the API to add a new feature that you're envisioning? Currently something like struct MGDegree <: ChemistryFeaturization.FeatureDescriptor.AbstractAtomFeatureDescriptor end is needed, which can then be called via encode(fd, ::GraphMol) to return the matrix, which can then be used elsewhere. I'm not sure CF can control what can go inside the encode since the feature descriptor could come from anywhere and only work with types not in control of CF. MolecularGraph is only the first, and might be one of the more forgiving ones.

What I'm envisioning is that we hopefully avoid defining lots of new structs (maybe this is personal preference, but I kind of hate reading code with lots of narrow-use-case types, especially if I have to go to some other file to track down what they actually are), so the goal would be to do all of these as instances of SpeciesFeatureDescriptor...I'm thinking to start with I'll create a bunch of "built-in" things along the lines of what we did for EFD's, but rather than a lookup table of values, it's a lookup table of functions. This hopefully will also give a template to users/developers of how they could implement another one themselves if needed by providing the relevant function.

DhairyaLGandhi commented 2 years ago

Removing special dispatches as much as possible is a good thing. If the method exists, it will get called anyway, if it doesn't it will clearly say why via a MethodError (and also indicate what needs to be defined), and Julia will always specialise for the type, whether we type it or not.

hate reading code with lots of narrow-use-case types

I tend to agree. We can also dispatch on functions directly, as I had mentioned. I think there might be a case for cleaning up some of the types in CF too.

it's a lookup table of functions

so... dispatch? It would be much faster considering dispatch is resolved at compile time (as far as possible, with runtime dispatch being pretty fast anyway), and lookup will happen at runtime and will always have additional overhead (which is much more than the runtime dispatch).

rkurchin commented 2 years ago

it's a lookup table of functions

so... dispatch? It would be much faster considering dispatch is resolved at compile time (as far as possible, with runtime dispatch being pretty fast anyway), and lookup will happen at runtime and will always have additional overhead (which is much more than the runtime dispatch).

Okay, haha, fair point, but if we do it that way then (SpeciesFeatureDescriptor becomes an abstract type, and also) the user has to define a new struct for any new feature they want to encode. I guess this is probably okay as long as we provide very clear instructions in the docs, I just feel like it could seem intimidating for some people.

rkurchin commented 2 years ago

Oh, I guess the function dispatch could be a plausible "intermediate" solution to creating a new type for every feature, as you said.

thazhemadam commented 2 years ago

Fairly minor point, but a change that I think could help clarify things for some people: should we change the "callable" action of an FD on a structure to be raw rather than encoded value? Since it's very unlikely anybody will use that signature to actually encode things anyway, itโ€™s probably more useful/intuitive to be able to get a human-readable value out of that, especially since we will now have more separated notions of computing a value and encoding it since the computation can be less trivial than a lookup table.

So instead of the encoded value, this would return the value that would be encoded? I think that makes more sense than the current behaviour tbh.

I guess that's a fair point; if we do what I suggested, we would have to remove that dispatch of encode...which I guess might actually get used internally sometimes...in either case, I think it would be useful to explicitly have a function to get the value of a feature vs. the encoded value since in many cases they're different.

I agree. It would be useful to use the callable syntax in a non-redundant way. I'd assume this would also encourage usage of the callable syntax and the encode function in a more well-defined and less confusing way. It separates the concerns nicely, so I like the idea.

(also, the fact that you can encode using the callable syntax, but there's nothing similar for decoding has annoyed me a little ๐Ÿ˜› )

thazhemadam commented 2 years ago

Okay, haha, fair point, but if we do it that way then (SpeciesFeatureDescriptor becomes an abstract type, and also) the user has to define a new struct for any new feature they want to encode. I guess this is probably okay as long as we provide very clear instructions in the docs, I just feel like it could seem intimidating for some people.

I think SpeciesFD should be an abstract type. If we need a lookup table of functions to decide what we're dispatching on, those types are probably better off as separate types, with SpeciesFD as their super-type. The "lookup table of functions" idea sounds to me like we're trying to do what the compiler should be doing, but very poorly in comparison.

DhairyaLGandhi commented 2 years ago

I think separating the original value vs the encoded value is a good thing. The way to handle it would usually be a function that looks like encode(f, x) where f can generically operate on the input and always return the encoded value (whose return type can be made a requirement for the time being), and maybe have a separate real_value(x::AbstractFD) = x.value (say).

We aren't "removing the dispatch" by getting rid of the type in that statement. Notice that it isn't doing anything special that any other feature wouldn't be expected to support anyway. Removing the type only means we are able to input things like GraphMol and output an AtomGraph or FeaturizedAtoms.

DhairyaLGandhi commented 2 years ago

Might help to see the type hierarchy we have for the FeatureDescriptors.

Julia> print_tree(ChemistryFeaturization.AbstractType.AbstractFeatureDescriptor)
AbstractFeatureDescriptor
โ”œโ”€ AbstractAtomFeatureDescriptor
โ”‚  โ”œโ”€ SpeciesFeatureDescriptor
โ”‚  โ”œโ”€ ElementFeatureDescriptor
โ”‚  โ””โ”€ MGDegree
โ”œโ”€ AbstractEnvironmentFeatureDescriptor
โ”‚  โ””โ”€ OrbitalFeatureDescriptor
โ””โ”€ AbstractPairFeatureDescriptor
   โ”œโ”€ BondFeatureDescriptor
   โ”‚  โ”œโ”€ BondType
   โ”‚  โ”œโ”€ InRing
   โ”‚  โ””โ”€ IsConjugated
   โ””โ”€ PairFeatureDescriptor

The question becomes, what interface would a new feature be expected to follow, and where in this tree would such a feature fit in with the right tradeoffs. Making new features as instances of existing FeatureDescriptors with a hook out for adding a completely separated system for encoding would be fine too I think. Notice that MGDegree could also be expressed as an instance of SpeciesFeatureDescriptor with a type param for dispatch. it would look like SpeciesFeatureDescriptor{tyepof(MolecularGraph.degree)} for typing purposes.

DhairyaLGandhi commented 2 years ago

We definitely need a way to keep the number of types getting out of hand.

DhairyaLGandhi commented 2 years ago

Another thing I thought was: If we need parameters, handling vectors, or anything of the sort, using a function can become somewhat illegible. Consider writing a feature as a function that requires additional arguments and parameters than the current system would allow without types.

function myfeature(x, extra_arg) ... end

To supply the additional arg in the FeatureDescriptor would look like

julia> SpeciesFeatureDescriptor(x -> myfeature(x, some_arg), ...)
SpeciesFeatureDescriptor{#47}(#47, ...)

losing the ability to "see" what the FD is.

Types don't have that issue, of course, but its a minor point I thought I'd share.

rkurchin commented 2 years ago

Okay, so it seems there's consensus on changing the "callable" syntax of FD's in general so I'll for sure do that.

In terms of how to implement species FD, I don't think we've come to a ๐Ÿ’ฏ clear consensus. There seem to be a few options:

  1. SFD becomes an abstract type, every feature is a concrete subtype --> my sense is we're leaning against this because it leads to a clunky proliferation of types, and also more work for a user who wants to add a new feature to support
  2. what I had originally suggested, wherein SFD has a field that stores the function, and also takes a type parameter describing what "origins" of atoms objects it can act on (this dispatch is fairly clean because you just match the type parameters)
  3. @DhairyaLGandhi's suggestion, where the function itself becomes a type parameter (possibly in addition to the type parameter I described above, which is also necessary IMO). Then we can dispatch on both the "matching" of underlying structure representation and also on the evaluation of the function. AFAICT, this is basically an entirely superior version of option 2 (unless, I suppose, you dislike types with multiple parameters).

So if this sounds reasonable, I'll push forward with option 3? In terms of actual implementation, how does that work? e.g., will it work to do (where A is the atoms type and F the function to compute the feature's value)

struct SpeciesFeatureDescriptor{A,F} <: AbstractAtomFeatureDescriptor
    # existing stuff that doesn't make explicit reference to A or F
end

and then would I be able to still access F to call it? Or do I need fields inside the struct that also store these? If not, it seems impossible to get away from the thing @DhairyaLGandhi had discouraged, namely, using type parameters in constructors...


Somewhat distinct discussion, relevant for docs and also for eventual names of fields, etc.: What is clear terminology for distinguishing values vs. encoded values of features? And similarly, the functions the compute those things, which might be even trickier to figure out, since I don't think something like raw_value vs. encoded_value is all that bad, and encode_f for the encoder is fine, but then what do you call the function that gets the raw_value in the first place?!? #ntih amirite @thazhemadam? ๐Ÿ˜ญ

codecov[bot] commented 2 years ago

Codecov Report

Merging #118 (16facc6) into main (aa7d310) will decrease coverage by 1.23%. The diff coverage is 88.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #118      +/-   ##
==========================================
- Coverage   87.42%   86.19%   -1.24%     
==========================================
  Files          17       18       +1     
  Lines         493      507      +14     
==========================================
+ Hits          431      437       +6     
- Misses         62       70       +8     
Impacted Files Coverage ฮ”
src/ChemistryFeaturization.jl 80.00% <0.00%> (-20.00%) :arrow_down:
src/features/speciesfeature.jl 72.72% <73.68%> (+72.72%) :arrow_up:
src/atoms/atomgraph.jl 75.94% <81.81%> (-1.33%) :arrow_down:
src/features/orbitalfeature.jl 78.12% <88.88%> (-2.53%) :arrow_down:
src/codecs/onehotonecold.jl 96.00% <95.00%> (-4.00%) :arrow_down:
src/codecs/directcodec.jl 100.00% <100.00%> (รธ)
src/codecs/simplecodec.jl 100.00% <100.00%> (รธ)
src/features/elementfeature.jl 80.95% <100.00%> (-7.29%) :arrow_down:
src/features/features.jl 100.00% <100.00%> (รธ)
src/featurizations/graphnodefeaturization.jl 75.47% <100.00%> (รธ)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more ฮ” = absolute <relative> (impact), รธ = not affected, ? = missing data Powered by Codecov. Last update aa7d310...16facc6. Read the comment docs.

DhairyaLGandhi commented 2 years ago

What's the callable syntax now and the updated version?

Generally if you want a type param in the constructor, it's alright but couldn't that also be gotten via a constructor that accepts a type or can infer the type? If there isn't a preferred solution then I'd suggest with having the type-param-in-constructor so as to not halt progress

DhairyaLGandhi commented 2 years ago

Yes, concrete types can have type params that the abstract type doesn't have, and you can access them like usual. Storing it as a function would be okay too.

rkurchin commented 2 years ago

Okay so in 7d08383 I made a couple changes summarized in the commit message, but I wanted to make a couple more comments here...

First, a short code snippet demo of the changed syntax to distinguish get_value (which is now what is called by the callable syntax on an FD) vs. encode:

julia> ag = AtomGraph(Float32.([0 1; 1 0]), ["H", "O"])
AtomGraph{SimpleWeightedGraph{Int64, Float32}}  with 2 nodes, 1 edges
        atoms: ["H", "O"]

julia> fd(ag)
2-element Vector{String}:
 "s"
 "p"

julia> encode(fd, ag)
4ร—2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0
 0.0  0.0
 0.0  0.0

I like this, personally. ๐Ÿ‘ Previously, to get the ["s", "p"] result, we'd have needed to encode and then decode, which is super clunky.

Second, in thinking about this, I've realized that as currently implemented, OneHotOneCold is not actually functionally all that different from @thazhemadam's SimpleCodec, since the actual onehot/onecold logic is all in EFD-specific utility functions. This commit is the first step towards separating that, but I'd like to finish moving that logic over there in a subsequent commit if you guys think it makes sense. Basically, the only things that need to be FD-specific are the computation of the value (now separated by the get_value function) and also figuring out what the bins are (we already have the EFD-specific get_bins function). Once you have those two things, the actual onehot encode/onecold decode are totally generic, and we should be able to move them all over there. I'll do that it another commit so hopefully if we want to revert one/both of these it won't be too hard to disentangle.

THINGS STILL TO DO!

DhairyaLGandhi commented 2 years ago

Does this mean fd(ag) == get_value(ag)?

rkurchin commented 2 years ago

Does this mean fd(ag) == get_value(ag)?

Yup!

rkurchin commented 2 years ago

Okay, 61a3abe now has the additional changes I was alluding to...

The basic concept is to continue the separation so that calculation of a featureโ€™s value happens entirely within the EFD, and the codec only handles encoding that to onehot or decoding it back. Iโ€™ve made some syntax changes in line with thisโ€ฆ

We previously had a callable syntax on codecs. Iโ€™ve removed this because I donโ€™t think itโ€™s necessary and I do think itโ€™s confusing. We now explicitly call encode(codec, val) or decode(codec, val). The loose philosophy here is โ€œFDโ€™s act on Atoms, Codecs act on values.โ€ One exception is that I did keep encode(fd, atoms) and decode(fd, val) as basically aliases to the codec analogues, as I think these are useful and not confusing.

For now, the logic for how to combine the encoded vectors for individual AFDโ€™s when encoding an entire atoms object is fixed to just be hcat. We may want to think at some point about generalizing that, but it will become trickier to define the decoding in that case.

@thazhemadam, eager for your thoughts on any/all of this!


Separate minor but annoying issue: I've somehow screwed up the scope of the get_value function so that it's accessible to internal functions in the package without issue, but when I do using ChemistryFeaturization in the REPL, I have to explicitly invoke it with FeatureDescriptor.get_value AND I CAN'T FIGURE OUT WHY (the main module ChemistryFeaturization.jl has export get_value in it, what else do I need?!?). @DhairyaLGandhi, are you a scope wizard in addition to an AD wizard? ๐Ÿง™

rkurchin commented 2 years ago

Also: so proud that as of now, this PR is a net decrease in lines of code. ๐Ÿ’ƒ

thazhemadam commented 2 years ago

Second, in thinking about this, I've realized that as currently implemented, OneHotOneCold is not actually functionally all that different from @thazhemadam's SimpleCodec, since the actual onehot/onecold logic is all in EFD-specific utility functions.

I'm not sure what exactly this means. ๐Ÿ˜… SimpleCodec is basically a skeleton, with just the encoding and decoding functions coupled together (which as of now we expect every Codec to have at the very least). The only difference is that OneHotOneCold actually has additional attributes that are also required for actually encoding/decoding using the encode_f/decode_f field.

Basically, the only things that need to be FD-specific are the computation of the value (now separated by the get_value function) and also figuring out what the bins are (we already have the EFD-specific get_bins function).

I assume you mean this with respect to onehot-onecold only? I assume one might need codec schemes that may be based on some other attributes of the data being represented by get_value itself. (however, I can't think of any, so it could possibly be me being overly cautious too lol).


We previously had a callable syntax on codecs. Iโ€™ve removed this because I donโ€™t think itโ€™s necessary and I do think itโ€™s confusing. We now explicitly call encode(codec, val) or decode(codec, val). The loose philosophy here is โ€œFDโ€™s act on Atoms, Codecs act on values.โ€ One exception is that I did keep encode(fd, atoms) and decode(fd, val) as basically aliases to the codec analogues, as I think these are useful and not confusing.

This sounds like a good idea to me. One reason why I don't like the current callable syntax is that Codecs facilitate two things at once - encoding and decoding, and having a single callable syntax that determines if it has to encode or decode depending on parameters always has been kind of clunky imo. The benefit of having it is that we can say "take this codec, and apply its codec-y logic on this atoms object for this feature descriptor". It's more restrictive, and users would be less prone to making silly mixups with the types, which I reckon we would still be able to enforce using the analogous encode and decode as necessary.

I'm on board with these changes. Assuming I've understood this correctly, it's worth taking a moment here to also see that this shifts the responsibility of actually encoding and decoding to the encode and decode functions. It would also fundamentally change the purpose of a Codec from "what does the encoding and decoding" to "what helps you do the encoding and decoding". (Note to self to update the docs to reflect that clearly before this PR is merged).

thazhemadam commented 2 years ago

Separate minor but annoying issue: I've somehow screwed up the scope of the get_value function so that it's accessible to internal functions in the package without issue, but when I do using ChemistryFeaturization in the REPL, I have to explicitly invoke it with FeatureDescriptor.get_value AND I CAN'T FIGURE OUT WHY (the main module ChemistryFeaturization.jl has export get_value in it, what else do I need?!?)

I'd be curious to know if there's a (more?) idiomatic way of going about this. IIRC, I've done this only in a "weave-in, weave-out" fashion, by importing it from the main module and exporting it from the submodule. (for instance, see 7e6ad96859f401972408488f201621b05db020d9)

DhairyaLGandhi commented 2 years ago

Re scoping - yeah. There's two things needed to fix this. First is that you should export get_value from the FeatureDescriptor module. Second, you need to actually call using .FeatureDescriptor in src/ChemistryFeaturization.jl without which, get_value was never included in the scope of ChemistryFeaturization the main module. Currently, only the ElementFeatureDescriptor and FeatureDescriptor (the symbol) are.

Re - idiomatic way: Yeah we usually don't have these many modules that are all intertwined and interdependent, there really isn't much benefit. You want them to be as self contained as possible. It causes the all the using statements in the files and confusion about what gets exported and to where. Also, I'd generally try to be cautious about what I export. exporting get_value , encode, decode etc seem like they could cause namespace collisions with other packages and their purpose can't directly be tied to CF because the names are too generic. Maybe we can overload Base.values for get_value?

thazhemadam commented 2 years ago

Maybe we can overload Base.values for get_value?

As nice as this sounds, I don't think it's a good fit. Base.values takes in an iterator and returns the value represented at that position in the data structure, which isn't logically the same thing as get_values does.

rkurchin commented 2 years ago

Finally getting back to this stuff after being tied up with some other things the past ~week and a half...first, to @thazhemadam's comment:

Second, in thinking about this, I've realized that as currently implemented, OneHotOneCold is not actually functionally all that different from @thazhemadam's SimpleCodec, since the actual onehot/onecold logic is all in EFD-specific utility functions.

I'm not sure what exactly this means. ๐Ÿ˜… SimpleCodec is basically a skeleton, with just the encoding and decoding functions coupled together (which as of now we expect every Codec to have at the very least). The only difference is that OneHotOneCold actually has additional attributes that are also required for actually encoding/decoding using the encode_f/decode_f field.

Based on the rest of your comments, I think you ended up figuring out what I meant even if I didn't explain it all that clearly, but I'll try again here just in case: Basically, we had this thing called OneHotOneCold and we called it a Codec, which I would intuitively interpret to mean that it handles encoding/decoding of the one-hot/one-cold variety. However, what it actually had was just generic encode_f and decode_f functions, and the fact that they happen to do one-hot/one-cold encoding/decoding wasn't enforced within the codec itself, but rather by EFD-specific functions that were fed into it. With the changes I'm making here, the encode/decode logic is moved into the codec itself, which both makes more intuitive sense to me and feels like a cleaner separation of concerns all-around. Is that clearer? ๐Ÿ˜†

Basically, the only things that need to be FD-specific are the computation of the value (now separated by the get_value function) and also figuring out what the bins are (we already have the EFD-specific get_bins function).

I assume you mean this with respect to onehot-onecold only? I assume one might need codec schemes that may be based on some other attributes of the data being represented by get_value itself. (however, I can't think of any, so it could possibly be me being overly cautious too lol).

In this case, I did mean OHOC only, but to me the fact that we've separated out the codec at all sort of implies a thesis that that separation is in general robust, and given I also can't think of any, I would argue we don't need to worry about it right now. #YAGNI


and re @DhairyaLGandhi's comment...

Re scoping - yeah. There's two things needed to fix this. First is that you should export get_value from the FeatureDescriptor module. Second, you need to actually call using .FeatureDescriptor in src/ChemistryFeaturization.jl without which, get_value was never included in the scope of ChemistryFeaturization the main module. Currently, only the ElementFeatureDescriptor and FeatureDescriptor (the symbol) are.

Re - idiomatic way: Yeah we usually don't have these many modules that are all intertwined and interdependent, there really isn't much benefit. You want them to be as self contained as possible. It causes the all the using statements in the files and confusion about what gets exported and to where. Also, I'd generally try to be cautious about what I export. exporting get_value , encode, decode etc seem like they could cause namespace collisions with other packages and their purpose can't directly be tied to CF because the names are too generic. Maybe we can overload Base.values for get_value?

Thanks for the scoping fix...I will someday get all that through my head and be able to it right the first (or, like, fifth) time.

Regarding the namespace conflicts, that's a reasonable point. My first thought was to add feature into the names e.g. encode_feature, but that's a bit confusing since it works for individual FD's but also featurizations, so I don't think that would work, though it would probably be okay for get_feature_value. We could obviously just not export them so you either explicitly import the function yourself as a user or always call ChemistryFeaturization.encode. I personally don't love that as it seems inconvenient, but it is probably the safest option. I agree with @thazhemadam on the Base.values not really being a good fit for that case...


Since it seems that there's general consensus on the changes being made here, my next plan is to update tests so that they pass with this stuff, and then start actually populating the SFD utils with some MolecularGraph functions so we can (finally!) start playing with those more "idiomatically."

rkurchin commented 2 years ago

Phew, that was fun. Note to @thazhemadam to do a review specifically on the orbitalfeature-related changes I made.

DhairyaLGandhi commented 2 years ago

but that's a bit confusing since it works for individual FD's but also featurizations

Seems reasonable to me, what would the source of confusion be?

Re values there's usually a bit of give in the interpretation of what what the APIs do in different contexts. Think of names(DataFrame) vs names(Module), but avoiding it for conceptual clarity is fair game.

rkurchin commented 2 years ago

Okay, starting to actually populate the SpeciesFeature stuff. See last commit message for some notes, but also here's the test code to see what's broken:

using MolecularGraph, ChemistryFeaturization
caffeine = smilestomol("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
ag = AtomGraph(caffeine)
sfd = SpeciesFeatureDescriptor("isaromatic")
get_value(sfd, ag) # this doesn't work, because the type parameters in the next line don't match...
typeof.([ag, sfd])

Specifically, here's the result:

julia> typeof.([ag, sfd])
2-element Vector{DataType}:
 AtomGraph{GraphMol{SmilesAtom, SmilesBond}}
 SpeciesFeatureDescriptor{GraphMol}

I assume there's gotta be some way to get those to register as the same thing (i.e. ignore the type parameter of GraphMol), but I'm not sure of the syntax to do it...

rkurchin commented 2 years ago

This is getting close to ready I think. What do you guys think about the way I've set up the SFD convenience constructors? If it looks good, I'll add all the other ones that are in WeaveModel's featurize.jl so that @DhairyaLGandhi and I can get the Weave stuff translated over to this new CF API.

(still to do: changelog/docs updates, SFD tests)

rkurchin commented 2 years ago

Some things are not totally done, but I think it makes sense to merge this and do a version bump at this point, especially since @DhairyaLGandhi and I are hoping to start benchmarking some models that need this functionality...

rkurchin commented 2 years ago

Yikes, so many tests suddenly failing when they pass no problem on my local machine...weird. @thazhemadam, do they pass locally for you? A bunch of the failures have to do with normalized_laplacian so maybe this is a LightGraphs vs. Graphs thing...

thazhemadam commented 2 years ago

I'll track this down. I think it might have something to do with version incompatibilities introduced in https://github.com/Chemellia/ChemistryFeaturization.jl/pull/118/commits/b3d1c905ab71b006ab35ce0dcf0e4d1ed57d7260 (I remember running into something similar a while ago when I tried to up this locally), but I'll know for sure when I bisect this.

so maybe this is a LightGraphs vs. Graphs thing...

I don't think that's likely the issue here as LightGraphs should still work properly.

DhairyaLGandhi commented 2 years ago

Were we able to track down the failures

thazhemadam commented 2 years ago

Were we able to track down the failures

I'd say, in a sense, we're closer. @rkurchin can now locally reproduce the errors and I managed to run into an entirely different error while precompiling, which I'm trying to fix (without nuking my ~/.julia)

rkurchin commented 2 years ago

Yeah, I can locally reproduce, as Anant said, and after this meeting I will dig into it a bit more...

rkurchin commented 2 years ago

There are definitely things about what exactly the local environment depends on that don't totally make sense to me, though. It's rather irritating that the tests could pass locally for me and then only fail once I nuke a bunch of things and regenerate them...