MolSSI / QCFractal

A distributed compute and database platform for quantum chemistry.

https://molssi.github.io/QCFractal/

BSD 3-Clause "New" or "Revised" License

144 stars 47 forks source link

modifed line to eliminate duplicate entries in the return dataframe of a ReactionDataset #704

Closed stvogt closed 2 years ago

stvogt commented 2 years ago

Description

When doing query results and in the database there are two molecules that are identical except for the keywords, then both molecules are pulled even thoughkeyword=None is explicitly used. This has as a consequence that the result record of both are stored in the dataframe of the ReactionDataset, which breaks the stoichiometry and gives wrong values when using ds.get_values(method='XXXX'). For example if the same molecule was computed with the keyword uks and None, for an open shell molecule the result is the same (since uks is the default), such that both result records appear in the dataframe as en entry because both are pulled from the database when building the stoichiometry in databse.py.

Changelog description

modifed line 1188 to remove duplicate entries in the dataframe

Status

[ ] Code base linted
[ ] Ready to go

codecov[bot] commented 2 years ago

Codecov Report

Merging #704 (0547934) into master (8b96203) will decrease coverage by 1.02%. The diff coverage is 100.00%.

bennybp commented 2 years ago

I don't see exactly how the reaction dataset can pull the same results (with different keywords). I do know that there can be duplicated results (with keywords = None or basis = None). Either way I think this temporarily fixes the issue.

The new version/branch I am working on fixes this issue entirely, but will be another month or 2 away. So this should work for now.

One question: Can you explain what is going on with the ~ operator? I know in general what it does, but not how it interacts with pandas & indices.

dotsdl commented 2 years ago

@bennybp the ~ is bitwise-not, and is used as a negation on booleans. In this case ret[0].index.duplicated(keep='first') gives back a dataframe of bools based on whether a given row is a duplicate of an earlier one, and the ~ flips the Trues and Falses so that only the "non-duplicates" (first instances) are kept from ret[0].

And agreed, this is an ugly hack but necessary at the moment until we have the refactor released. I'm cool with merging it.

dotsdl commented 2 years ago

@stvogt please run black against the changed file and push. That will address the lint failure. Thanks!

bennybp commented 2 years ago

I will merge this and cleanup in the release commits