reaction_selection.py empty dataframe

JeremyMolineau commented 6 months ago

I was having trouble with the part focus on removing reactions with more than 2 siblings under the reaction_selection.py file (line 244). Following the script with validated_reactions.csv as input, it has stopped because the dataframe generated has no unique() attribute.

It seems that the pandas series named "info" is not extracting anything with the specificity "P" because no reaction ID has this identifier. So later for unique_doc it couldn't apply unique() because it is applied on an empty dataframe. I put these lines as a comment and it worked to generate the validated_reactions_all.csv file. Is this due to a change in the USPTO ID?
Also for "sib_sel = data["document_id"].isin(unique_doc)" (line 253) there is no "document_id" in the input data. Was it intended for responses from a source other than the USPTO?

mujeebarshad commented 6 months ago

I was having trouble with the part focus on removing reactions with more than 2 siblings under the reaction_selection.py file (line 244). Following the script with validated_reactions.csv as input, it has stopped because the dataframe generated has no unique() attribute.
* It seems that the pandas series named "info" is not extracting anything with the specificity "P" because no reaction ID has this identifier.
  So later for unique_doc it couldn't apply unique() because it is applied on an empty dataframe.
  I put these lines as a comment and it worked to generate the validated_reactions_all.csv file.
  Is this due to a change in the USPTO ID?

* Also for "sib_sel = data["document_id"].isin(unique_doc)" (line 253) there is no "document_id" in the input data.
  Was it intended for responses from a source other than the USPTO?

I am facing the similar issue where it says "attribute unique does not exists" it because the same reason as you mentioned in the issue. Where you able to fine the work around for this one? Where you able to successfully generate the uspto_ringbreaker_template_library.csv?

SGenheden commented 6 months ago

I pushed a branch with a bugfix for this. This was intended for AstraZeneca specific data structures.

Check out the new branch

git pull
git checkout 17-bugfix-no-siblings

and please confirm if this work. Then I will create a PR and push this to master

mujeebarshad commented 6 months ago

I pushed a branch with a bugfix for this. This was intended for AstraZeneca specific data structures.

Check out the new branch
git pull
git checkout 17-bugfix-no-siblings
and please confirm if this work. Then I will create a PR and push this to master

Worked for me. Thank you for quick response.

JeremyMolineau commented 6 months ago

We have tried to use your branch named "17-bugfix-no-siblings". There is still a small problem with line 273 in the reaction_selection.py file, because sib_sel is under if conditions in your modification.

SGenheden commented 6 months ago

Ok. That what happens if you try quick solutions without testing ;-) Can you suggest a code snippet that solves the problem?

JeremyMolineau commented 6 months ago

Of course, we suggest a tag addition (tag_sib) with @SBC-ICOA:

From line 244 :

info = data["id"].str.extract(r"_P(?P<product_no>\d)$", expand=False)
prod_val = info[~info.isna()].astype(int)
tag_sib = False
if len(prod_val) > 0 and "document_id" in data.columns:
tag_sib = True

prod_sel = prod_val > 2
data_sel = data.index.isin(prod_val[prod_sel].index)
unique_doc = (
    data[data_sel]
    .apply(lambda row: row["id"].split(f"_{row['source']}_")[0], axis=1)
    .unique()
)
sib_sel = data["document_id"].isin(unique_doc)
print_(f"Removing {sib_sel.sum()} reactions with with more than 2 siblings")

And from line 266 :

if tag_sib = True
data = data[
(~sel_likelihood)
& (~sel_small)
& (~sel_big)
& (~sel_wildcard_atom)
& (~sel_unchanged)
& (~sel_radical)
& (~unwanted_classes)
& (~sel_cgr_creation)
& (~sel_dynamic_bond)
& (~sib_sel)
& (~nrings_sel)
]
else:
data = data[
(~sel_likelihood)
& (~sel_small)
& (~sel_big)
& (~sel_wildcard_atom)
& (~sel_unchanged)
& (~sel_radical)
& (~unwanted_classes)
& (~sel_cgr_creation)
& (~sel_dynamic_bond)
& (~nrings_sel)
]

We tested this on a local branch and it worked.

SGenheden commented 5 months ago

Thanks for the suggestion. I made a push with a slightly shorter solution.

SGenheden commented 5 months ago

Closing this. #19 merged.

MolecularAI / aizynthtrain

reaction_selection.py empty dataframe #17