ReactionMechanismGenerator / RMG-website

A Django-powered website for Reaction Mechanism Generator (RMG)
Other
21 stars 29 forks source link

PopulateReactions not working properly #118

Closed mliu49 closed 8 years ago

mliu49 commented 8 years ago

The website was not updated following ReactionMechanismGenerator/RMG-Py@bfa7516fc8b80691d414c76e18099bece4d1cde1, so the PopulateReactions tool does not work at all, as can be seen on the dev site.

However, even after making the appropriate change (5ca3188632e51125e2713d9829b5e7e152dfda61), the tool still does not seem to work properly when tested locally. Specifically, it will not find a single reaction. Testing the RMG-Py script directly from the command line gives the expected result, so the problem appears to be on the website side. The live website is currently working, but when I checked out the same commits locally, it also gave zero reactions.

One difference I noticed is that the website will give the following warning

Should only make one instance of RMGDatabase because it's stored as a module-level variable!
Unexpected behaviour may result!

while running from the command line will not, which I assume is because the website preloads the database? However, I don't know why that would result in not finding any reactions, and I'm also extremely confused as to how the live website is working.

I'm also working on writing unit tests for this.

connie commented 8 years ago

Found the reason why this doesn't work:

rmgpy.rmg.react.reactMolecules uses getDB('kinetics').families. The retrieved database is a global database through rmgpy.data.rmg.getDB. The returned families is an empty dictionary, which is why it will react over no families.

Basically the error msg comes from the global database not being set by the RMGDatabase instance that was created when we initialize the RMG object in generate_reactions: see line 67 in rmgpy.data.rmg. Therefore it cannot retrieve the proper database.

mliu49 commented 8 years ago

I'm not sure I completely understand. Why does it work if executed from the command line, but not through the website?

mliu49 commented 8 years ago

Ok, I understand what you mean now.

  1. The website pre-initializes the global database (Database A), but does not load any data
  2. generate_reactions initializes a second database (Database B) and loads the necessary data
  3. Database B is not stored as the global variable because one already exists (Database A), hence the warning messages
  4. reactMolecules uses getDB, which queries the global database (Database A) for data
  5. Because Database A has no data, no reactions are generated

When running from the command line, Database A does not exist, so the global database and the database created by generate_reactions are the same object.

This doesn't seem to be desirable behavior, but I'm not sure what the best way to fix it would be.

As a side note, the live website is still on a commit prior to the implementation of getDB, which is why it works.

mliu49 commented 8 years ago

It seems that this problem needs to be solved via changes in RMG-Py, since the website has to initialize the database on startup. Possible options that I can think of include the following:

  1. In rmgpy.rmg.main.RMG.loadDatabase(), somehow check to see if the global database exists, and if so, don't create a new instance
    • I'm not sure if this is actually doable
  2. In rmgpy.data.rmg.RMGDatabase(), assign the global database to self if the global database already exists
    • I'm also not sure if this is possible or if it will break other things
  3. Allow for some argument to be supplied when initializing an RMGDatabase object which will overwrite the global database with the newly created object
    • This seems like a good option, but there's probably a downside I'm not seeing

Any thoughts or better suggestions would be appreciated!

nickvandewiele commented 8 years ago

Regarding option 1: is this not already happening in the constructor of rmgpy.data.rmg.RMGDatabase()? Could we add a similar check to the rmgpy.data.rmg.RMGDatabase.loadDatabase() method?


# Module-level variable to store the (only) instance of RMGDatabase in use.
database = None

class RMGDatabase:
    """
    The primary class for working with the RMG database.
    """

    def __init__(self):
        ...
        # Store the newly created database in the module.
        global database
#        assert database is None, "Should only make one instance of RMGDatabase because it's stored as a module-level variable."
        if database is None:
            database = self
        else:
            logging.warning("Should only make one instance of RMGDatabase because it's stored as a module-level variable!")
            logging.warning("Unexpected behaviour may result!")
mliu49 commented 8 years ago

It is being checked in rmgpy.data.rmg.RMGDatabase(). but at that point, a new RMGDatabase() object has already been created, which leads to self and global database becoming two different objects. The intention of option 1 would be to perform the check earlier, before rmgpy.rmg.main.RMG.loadDatabase() does self.database = RMGDatabase(). However, the more I think about it, the less viable that idea seems.

rwest commented 8 years ago

I've not been following the context very closely, but how about something like

In [1]: import rmgpy.data.rmg

In [2]: the_one_and_only = rmgpy.data.rmg.database or rmgpy.data.rmg.RMGDatabase()

In [3]: id(the_one_and_only)
Out[3]: 4418936200

In [4]: the_one_and_only = rmgpy.data.rmg.database or rmgpy.data.rmg.RMGDatabase()

In [5]: id(the_one_and_only)
Out[5]: 4418936200

?

connie commented 8 years ago

Option 1: The problem with this is that the two databases loaded will actually be different. The populate reactions input file will have the user specified thermo and kinetics libraries, whereas the website loaded one generally has ALL of the libraries I think? Then again I'm not sure what's going on because none of the families are loaded in that one. But there will definitely be a problem with the different database contents.

For the reason of different loading affecting the generated outcome.. none of the options seem super viable to me. Unless there is some way to not use all the contents (families/libraries) loaded within an RMGDatabase object but instead selectively obtain them.

mliu49 commented 8 years ago

The website preloads the database in order to populates lists for thermo libraries, kinetics libraries, and solvents, for the input file generator and solvation search. Therefore, it doesn't load the kinetics families.

Nick also suggested the option of loading everything at the beginning, but then the generate_reactions script would need to be reworked to account for user selection of families and libraries. Currently, it relies heavily on methods used for typical RMG runs, so it expects the database to be loaded according to the input file.

With respect to the databases being different, after the rmgpy.data.rmg.RMGDatabase() object is created, rmgpy.data.rmg.RMGDatabase.load() is called to actually load the databases, which I think basically resets what's been loaded. The issue is that the load method loads everything to the self object, while the getDB methods gets data from the global database.

Perhaps changing the load method to load to the global database might be an option? However, if that were done, whenever an RMGDatabase() object is instantiated, a new object would be created but never used because loading and getting data would all occur with the global object.

Stepping back though, the idea that the website reloads the global database object every time a user submits a populate reactions task seems to be undesirable. Perhaps it would actually be best to redesign the website to load everything once, then have populate reactions only use the user selected libraries/families. However, then it would diverge from the generate reactions script in RMG-Py.

Sorry for so much text. It doesn't seem like there's going to be a simple solution.

mliu49 commented 8 years ago

After discussing with Connie, we think the easiest solution is to actually revert to using a subprocess call. It doesn't seem worth it to modify multiple parts of the core RMG code to fix one website tool.