identifiers.py - Githubissues

Thanks for working on this and asking for my comments. Here are my comments at first glace:

The optional autoload and cache arguments seem good to me. Just for good measure, we might want to clear the "chemical_search_cache" if the length is really long (although probably not a problem in long run). I would think it would be better to have the "chemical_search" function wrap a "_chemical_search" function that does not save the cache. This would decrease all the extra layers of if-statements for every return scenario.
Loading new files will replace items in "names_index". I've had the problem that Anion and Cation common names overlap with the ones from the main_db (which are the ones you would expect to get). As of right now, "autoload_next" will replace common names from the main_db with the ones in the user_db. In thermosteam.properties.chemicals, I autoload the main_db then the rest and make sure items are not replaced in "names_index". I understand that the main_db is pretty big, so maybe there is a better way to solve this problem such that main_db file is autoloaded last.
The next in the "autoload_next" method name gives the false impression that files are loaded one by one (when they are actually all loaded at the same time).
The elements are loaded into the ChemicalMetadataDB instance, but the search_chemical function searches through the periodic table first... lines 361-379 could be deleted and the performance would be better.
A main_db, user_dbs, can_autoload , and loaded_main_db property/attributes can probably be simplified. I would suggest having two lists, a loaded_files and an unloaded_files. We could pop unloaded_files, load them, and append them to loaded_files and keep autoloading until we find the chemical or run out of unloaded files. This would also make it more clear which files have been loaded (in contrast to a can_autoload property), and the possibility of staging files to autoload (which is not possible right now in chemicals; but is possible in thermosteam).

Here is one additional small change:

# Original
def search_pubchem(self, pubchem, autoload=True):
    if type(pubchem) != int:
        pubchem = int(pubchem)
    return self._search_autoload(pubchem, self.pubchem_index, autoload=autoload)

# This is slightly faster and concise
def search_pubchem(self, pubchem, autoload=True):
    return self._search_autoload(int(pubchem), self.pubchem_index, autoload=autoload)

I hope these comments are helpful and I didn't misinterpret any code, Thanks!

Hi Yoel,

Thank you for the comprehensive feedback. I have incorporated it into the file, hopefully making it quite a bit better. The changes are:

1) The design of the database is now - load all the small databases; load the elements. If a search fails, load the large database; load the small databases; load the elements. Search again. This makes sense to me because the small databases are, well, small. 2) We seem to have different experiences with name conflicts. For me, my intention has always been for the small databases to supersede the values in the big database I can't really edit. If you have a conflict, please let me know. 3) I renamed autoload_next to finished_loading. 4) The element search is necessary to handle searches like "nitrogen" and 1 correctly. I cleaned it up and added some comments. 5) I implemented a fast cache size limit and did the wrapper function like you suggested - a clear improvement.

Thank you for the feedback and I hope you can find the issue you had with ions so we can fix it.

Sincerely, Caleb

CalebBell / chemicals

identifiers.py #18