bjodah / chempy

⚗ A package useful for chemistry written in Python
BSD 2-Clause "Simplified" License
552 stars 80 forks source link

Question about comparing formulae #235

Open hepcat72 opened 2 months ago

hepcat72 commented 2 months ago

I was wondering if you might be able to provide any insights to my following problem.

We receive chemical compound formulae from some Mass Spec software (Maven or El Maven). There's always a compound name that accompanies the formula, but that isn't always consistent (since compounds can have many synonyms). When we don't have a matching compound name/synonym, we have to determine if the name provided is a synonym of an existing compound in our database or if we need to add a new compound to our database.

There are of course multiple ways to accomplish this, but one helper method I added recently was to present the researcher with a list of possible compound matches. I naïvely did this by matching all existing compounds in our database with the same formula. And we immediately encountered the fact that this can miss existing entries because the formula from the mass spec data can represent the ionized version of the compound (missing or with an extra H).

My subsequent (naïve) thought was to expand the search to include matches that differ by some threshold of hydrogens. You might be able to provide better suggestions for this strategy, but if that DOES sound reasonable, is there an existing method in your package that can compare formulas or take the difference of 2 formulas, e.g. C19H37NO5 - C19H35NO5 = H2?

jeremyagray commented 2 months ago

You could use the parsing functions, like formula_to_composition to return the composition dict. Its keys are the atomic number, so you could use the hydrogen key to compare the number of hydrogens.

On Thu, Aug 22, 2024 at 11:52 AM Robert Leach @.***> wrote:

I was wondering if you might be able to provide any insights to my following problem.

We receive chemical compound formulae from some Mass Spec software (Maven or El Maven). There's always a compound name that accompanies the formula, but that isn't always consistent (since compounds can have many synonyms). When we don't have a matching compound name/synonym, we have to determine if the name provided is a synonym of an existing compound in our database or if we need to add a new compound to our database.

There are of course multiple ways to accomplish this, but one helper method I added recently was to present the researcher with a list of possible compound matches. I naïvely did this by matching all existing compounds in our database with the same formula. And we immediately encountered the fact that this can miss existing entries because the formula from the mass spec data can represent the ionized version of the compound (missing or with an extra H).

My subsequent (naïve) thought was to expand the search to include matches that differ by some threshold of hydrogens. You might be able to provide better suggestions for this strategy, but if that DOES sound reasonable, is there an existing method in your package that can compare formulas or take the difference of 2 formulas, e.g. C19H37NO5 - C19H35NO5 = H2?

— Reply to this email directly, view it on GitHub https://github.com/bjodah/chempy/issues/235, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOQCHSYLXLA54BKLWJ6WX7LZSYJLBAVCNFSM6AAAAABM6RJIP6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4DCMRZGM3TCNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jeremy A. Gray