Implement free-text parsing for chemical formulas, like in the website

eladnoor commented 6 years ago

Not sure about this, since users might be tempted to use it instead of much more robust solutions based on chemical ID mappings. Please add your comments below explaining why you think we should do it.

NoahMesfin commented 6 years ago

background With ScrumPy metabolic modelling software, metabolite names are not constrained to IDs associated with a database. Genome-scale metabolic models are built using BioCyc databases and manual curation, and so metabolite names, by convention only, use BioCyc naming schemes e.g. ATP is "ATP", D-LACTATE is "D-LACTATE" and some metabolites have less readable IDs e.g. (R)-acetoin is "CPD-10353". But larger compounds not found in databases, such as a techoic acid specific to the bacterium Acetobacterium woodii is allowed to be user defined, with the relationships to enzymes, genes etc also user defined. Possible stoichiometric errors are checked with other ScrumPy tools.

Another benefit of allowing user defined names is seen with other types of modelling. For example, ScrumPy is currently being used to model the kinetics of neuronal glycine receptors in different conformational states. These states are not defined in any database.

actual request I understand the cases mentioned above may not be relevant to equilibrator at this time, and of course its clear without rigourous naming schemes provided by databases, changes in gibbs energy may not be calculated accurately. However, the web interface has a very attractive implementation which allows for the use of readable specific compound names. Equilibrator API was also very easy to install which I agree would be a shame to lose by requiring more dependencies.

Could a possible solution to map text to chemicals be to provide an extension package for non-KEGG users? Together with the check_full_reaction_balancing() function, this may be enough to keep the rigour you need.

Thanks for all your efforts!

flamholz commented 6 years ago

The extension package seems like the right idea to me.

On Wed, Apr 25, 2018 at 5:40 AM Noah Mesfin notifications@github.com wrote:

background With ScrumPy modelling software, metabolite names are not constrained to IDs associated with a database. Genome-scale metabolic models at the moment are being built using BioCyc databases and manual curation, and so metabolite names by convention only use BioCyc naming schemes e.g. ATP is "ATP", D-lactate is "D-LACTATE" and some metabolites have less readable IDs e.g. (R)-acetoin is "CPD-10353". But larger compounds not found in databases, such as a techoic acid specific to the bacterium Acetobacterium woodii, is allowed to be user defined although relationships of these compounds to enzymes, genes etc is also user defined. Possible stoichiometric errors introduced this way is checked using other ScrumPy tools.

Another benefit of allowing user defined names is seen with other types of modelling. For example, ScrumPy is currently being used to model the kinetics of neuronal glycine receptors in different conformational states. These states are not defined in any database.

actual request I understand the cases mentioned above may not be entirely relevant to equilibrator at this time, and of course it's clear that without rigorous naming schemes provided for by databases, changes in gibb's energy many not be calculated accurately. However, the equilibrator web interface has a very attractive implementation allowing for readable specific compound names to be used. Equilibrator API is also very easy to install, which I agree would be a shame to lose by having more dependencies.

Could a possible solution for allowing rigorous but readable naming in the API be to provide an extension package for non-KEGG users which does the text to chemical matching? Together with the check_full_reaction_balancing() function, this might be enough to keep the rigour you need.

Thanks for all your efforts!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eladnoor/equilibrator-api/issues/9#issuecomment-384272009, or mute the thread https://github.com/notifications/unsubscribe-auth/ACg2ga_2kVjvXTOHZUGwBHyoyQJn9wdXks5tsG6kgaJpZM4Ti8cb .

eladnoor commented 6 years ago

I implemented the parser, reusing the code from eQuilibrator online. This means that the chosen KEGG IDs for a reaction will always match the ones in eQuilibrator. However, thorough testing is required to verify this. The new dependencies are: pandas, pyparsing, and nltk.

eladnoor / equilibrator-api

Implement free-text parsing for chemical formulas, like in the website #9