ClapeyronThermo / GCIdentifier.jl

tools to perform group contribution (GC) identification, given the SMILES of a compound
MIT License
14 stars 2 forks source link

Group assignment failing for 1,3,5-trioxane and incorrect (when compared to literature) for 1,3-dioxane; also different for 2-MeTHF #14

Closed thomvet closed 1 month ago

thomvet commented 1 month ago

Consider the following code:

e = get_groups_from_smiles("C1COCOC1", UNIFACGroups) #1,3-dioxane
f = get_groups_from_smiles("C1OCOCO1", UNIFACGroups) #1,3,5-trioxane

SMILES strings for the two molecules are from wikipedia (https://en.wikipedia.org/wiki/1,3-Dioxane), https://en.wikipedia.org/wiki/1,3,5-Trioxane) and I arrive at the same when I do it "by hand". I have also checked them with an online structure editor -> smiles generator tool as well (https://cactus.nci.nih.gov/translate/). So, they are presumably correct.

If you are running the code above, you will see that a result is obtained for 1,3-dioxane, but 1,3,5-dioxane fails with an error message (that is actually also a bug; I'll open a PR for this shortly).

1,3-dioxane results: ("C1COCOC1", ["CY-CH2" => 1, "THF" => 1, "CY-CH2O" => 1]) This looks reasonable, because the whole molecule is covered (THF group is actually CH2OCH2 in UNIFAC, then a cyclic-CH2 and another cyclic CH2O group). However, I would argue that this fragmentation is most likely wrong.

This is based on the fragmentation given in one of the original UNIFAC-Dortmund papers (https://pubs.acs.org/doi/pdf/10.1021/ie00013a024). Specifically, the following fragmentation is given for the two above-mentioned molecules: image

The CY-CH2O group is actually given as "c-CH2O[CH2]_{1/2}" (curly braces introduced by me to indicate subscript). Presumably, this must be understood to mean that the group to be matched here is actually a CH2OCH2 group, but that only half of the second CH2 group belongs to this UNIFAC group, while the other half belongs to another UNIFAC group. Hence, the result of the fragmentation given in the snapshot.

Similarly, the 1,3,5 trioxane case should be fragmented into 3 TRIOXAN groups.

Also, I am surprised that these two different groups have the same SMARTS strings attached to them:

GCPair(raw"[CX4H2;R][OX2;R]","CY-CH2O")
GCPair(raw"[CX4H2;R][OX2;R]","TRIOXAN")

I have a hard time accepting that that could be correct?

There's another weird thing I saw. The fragmentation of 2-MethylTHF (a common "green" solvent) is seemingly correct, but it differs from what I got out of the DDBST online group assignment tool when it was still online and for free (apparently they removed it some time ago :) or at least I am unable to find it...)

What I get from: a = get_groups_from_smiles(smiles[1], UNIFACGroups) is ("O1C(C)CCC1", ["CH3" => 1, "CY-CH2" => 2, "CY-CH" => 1, "CY-CH2O" => 1]) That's also inconsistent with how the paper mentioned above implemented the CY-CH2O group (CH2O[CH2]_{1/2}).

What I got from DDBST group assignment tool was: 1 x CH3 group 1 x THF group 2 x cy-CH2 group (which is the fragmentation of tetrahydrofuran, but with a methyl group added...).

Now, obviously, this is also "incorrect" in the sense that the THF group in Unifac-Do should actually represent a cyclic CH2OCH2 and 2-MethylTHF has a CH2OCH group (therefore, the above fragmentation actually just adds a hydrogen atom!) - maybe it's worth comparing both fragmentations against experimental data and see which one is better?

pw0908 commented 1 month ago

Hi Thomas,

Wow, there's a lot to work through here. Clearly one thing we need to fix is our SMARTS for the TRIOXAN, THF and CY-CH2O groups (we borrowed our definitions from the thermo package) I decided to add some conditional strings to each of them to hopefully improve the representation. For both THF and Trioxane, the results are as expected:

julia> get_groups_from_smiles("C1OCCC1", UNIFACGroups)
("C1OCCC1", ["CY-CH2" => 2, "THF" => 1])

julia> get_groups_from_smiles("C1OCOCO1", UNIFACGroups)
("C1OCOCO1", ["TRIOXAN" => 3])

As for 1,2-dioxane, Im not quite sure what the best strategy would be. Our approach treats whole atoms; this 1/2 a functional group is not compatible with this. The best I can think of right now is to define a larger group (CY-CH2OCH2OCH2) and have that translate to 2 CY-CH2O groups. It would mean that this group can't be used with the TRIOXAN group; although, looking at it, I can't imagine these two groups being used in conjunction. (Thoughts @longemen3000)

In your case, I dont think 2-methyl THF can be represented using UNIFAC. No combination of the UNIFAC groups can cover all atoms appropriately.

I hope this helps!

Best regards,

Pierre

longemen3000 commented 1 month ago

Hmm, good idea, we can match larger groups and then substitute those by the actual group types. If you look it closely, the TRIOXAN group can only be surrounded by other TRIOXAN groups or a CY-cH2O group, whereas a CY-cH2O groups must have one TRIOXAN or CY-CH2O.

We have the infrastructure to do the substitution, we just have to see how to hook up the postprocessing stage

thomvet commented 1 month ago

Sorry for the massive initial post, but I thought it's better to be detailed.

Wow, there's a lot to work through here. Clearly one thing we need to fix is our SMARTS for the TRIOXAN, THF and CY-CH2O groups (we borrowed our definitions from the thermo package) I decided to add some conditional strings to each of them to hopefully improve the representation. For both THF and Trioxane, the results are as expected:

julia> get_groups_from_smiles("C1OCCC1", UNIFACGroups)
("C1OCCC1", ["CY-CH2" => 2, "THF" => 1])

julia> get_groups_from_smiles("C1OCOCO1", UNIFACGroups)
("C1OCOCO1", ["TRIOXAN" => 3])

This looks like appropriately fixed; great!

As for 1,3-dioxane, Im not quite sure what the best strategy would be. Our approach treats whole atoms; this 1/2 a functional group is not compatible with this. The best I can think of right now is to define a larger group (CY-CH2OCH2OCH2) and have that translate to 2 CY-CH2O groups. It would mean that this group can't be used with the TRIOXAN group; although, looking at it, I can't imagine these two groups being used in conjunction. (Thoughts @longemen3000)

I agree that this seems a feasible solution. Are the molar masses of the groups somewhere defined in your package ecosystem? (Clapeyron.jl?); if they are somewhere defined/used, this could lead to some confusing discrepancies if one is not careful?

In your case, I dont think 2-methyl THF can be represented using UNIFAC. No combination of the UNIFAC groups can cover all atoms appropriately.

I agree that it can't be properly represented with the UNIFAC groups, but the Dortmund Databank keeps insisting in its current version (checked on the Explorer version) that the groups should be assigned as: 1 x CH3 group 1 x THF group 2 x cy-CH2 group I'll pass an email to them to get their opinion why they are claiming that this is fine. Maybe they have their reasons (other than "hey, it's the closest we can get").

edit: just checked. It's doing the analogous thing for 2,5-DimethylTHF and 2,2,5,5-TetramethylTHF. Hm.

pw0908 commented 1 month ago

I'll try to do the proposed fix at some point tomorrow. As for the group molar masses, we do have them in Clapeyron.jl here. What confusions are you worried of here? As long as the CY-CH2O group has the correct Mw, I don't forsee any issues?

As for DDB, unless they updated the group definition in a later paper (which they may have), I don't see how it could be able to represent any of these molecules. If they haven't fitted them and you still need some sort of GC approach, it would be relatively easy to fit the parameters, assuming you have the data.

pw0908 commented 1 month ago

Alright, this latest push addresses the issue with 1,2-dioxane. I've added some tests to confirm this:

julia> get_groups_from_smiles("C1OCOCO1", UNIFACGroups)
("C1OCOCO1", ["TRIOXAN" => 3])

julia> get_groups_from_smiles("C1OCOCC1", UNIFACGroups)
("C1OCOCC1", ["CY-CH2" => 1, "CY-CH2O" => 2])

julia> get_groups_from_smiles("C1OCCC1", UNIFACGroups)
("C1OCCC1", ["CY-CH2" => 2, "THF" => 1])

These group assignments are all correct according to the original table you sent @thomvet. Unfortunately it won't fix the issue with MeTHF but that's more of a UNIFAC problem.

thomvet commented 1 month ago

This fix looks great. Regarding the discrepancies: it's not an issue; the molar weight is correctly reflected in Clapeyron.jl.

By the way, I got an answer from DDB; apparently they looked into splitting the THF subgroup further up to cover things like 2-MeTHF, but there wasn't really a benefit for it. The practice now is that the THF group gets used even if the involved carbon atoms are (mono- or di-) substituted.

One way of implementing this would be to have two new groups (cy-CHO, cy-CO) with the correct molar weights and same interaction parameters as the THF group, so that: 2-MethylTHF gets fragmented into cy-CHO, CH3, 2 x cy-CH2 2,2-DimethylTHF gets fragmented into cy-CO, 2 x CH3, 2 x cy-CH2

I realize that this would be a change that isn't really backed up by a publication (or at least I haven't found it yet), so maybe it's something I have to do in a local version.

pw0908 commented 1 month ago

Regarding this unpublished approach. We wouldn't mind including it in the package as long as they'd be willing to give some sort of indication that this is the recommended approach on their website. This is an open-source project so we can implement these things as we wish, but we wouldn't want future users to see this weird(?) assignment and wonder where it comes from without a source to back it up.

Otherwise, assignments like this are completely up to the user and you can make your own modifications as needed.

Anyways, for the most part, I think we've dealt with the main issue with respect to the published UNIFAC groups?

thomvet commented 1 month ago

Yes, issue is resolved. I am closing it. Regarding source: I agree with your viewpoint. I'll see whether it's possible to get something "on the record", but I am not that hopeful.