Closed JeremyMolineau closed 5 months ago
I was having trouble with the part focus on removing reactions with more than 2 siblings under the reaction_selection.py file (line 244). Following the script with validated_reactions.csv as input, it has stopped because the dataframe generated has no unique() attribute.
* It seems that the pandas series named "info" is not extracting anything with the specificity "P" because no reaction ID has this identifier. So later for unique_doc it couldn't apply unique() because it is applied on an empty dataframe. I put these lines as a comment and it worked to generate the validated_reactions_all.csv file. Is this due to a change in the USPTO ID? * Also for "sib_sel = data["document_id"].isin(unique_doc)" (line 253) there is no "document_id" in the input data. Was it intended for responses from a source other than the USPTO?
I am facing the similar issue where it says "attribute unique does not exists" it because the same reason as you mentioned in the issue. Where you able to fine the work around for this one? Where you able to successfully generate the uspto_ringbreaker_template_library.csv
?
I pushed a branch with a bugfix for this. This was intended for AstraZeneca specific data structures.
Check out the new branch
git pull
git checkout 17-bugfix-no-siblings
and please confirm if this work. Then I will create a PR and push this to master
I pushed a branch with a bugfix for this. This was intended for AstraZeneca specific data structures.
Check out the new branch
git pull git checkout 17-bugfix-no-siblings
and please confirm if this work. Then I will create a PR and push this to master
Worked for me. Thank you for quick response.
We have tried to use your branch named "17-bugfix-no-siblings". There is still a small problem with line 273 in the reaction_selection.py file, because sib_sel is under if conditions in your modification.
Ok. That what happens if you try quick solutions without testing ;-) Can you suggest a code snippet that solves the problem?
Of course, we suggest a tag addition (tag_sib) with @SBC-ICOA:
From line 244 :
info = data["id"].str.extract(r"_P(?P<product_no>\d)$", expand=False)
prod_val = info[~info.isna()].astype(int)
tag_sib = False
if len(prod_val) > 0 and "document_id" in data.columns:
tag_sib = True
prod_sel = prod_val > 2
data_sel = data.index.isin(prod_val[prod_sel].index)
unique_doc = (
data[data_sel]
.apply(lambda row: row["id"].split(f"_{row['source']}_")[0], axis=1)
.unique()
)
sib_sel = data["document_id"].isin(unique_doc)
print_(f"Removing {sib_sel.sum()} reactions with with more than 2 siblings")
And from line 266 :
if tag_sib = True
data = data[
(~sel_likelihood)
& (~sel_small)
& (~sel_big)
& (~sel_wildcard_atom)
& (~sel_unchanged)
& (~sel_radical)
& (~unwanted_classes)
& (~sel_cgr_creation)
& (~sel_dynamic_bond)
& (~sib_sel)
& (~nrings_sel)
]
else:
data = data[
(~sel_likelihood)
& (~sel_small)
& (~sel_big)
& (~sel_wildcard_atom)
& (~sel_unchanged)
& (~sel_radical)
& (~unwanted_classes)
& (~sel_cgr_creation)
& (~sel_dynamic_bond)
& (~nrings_sel)
]
We tested this on a local branch and it worked.
Thanks for the suggestion. I made a push with a slightly shorter solution.
Closing this. #19 merged.
I was having trouble with the part focus on removing reactions with more than 2 siblings under the reaction_selection.py file (line 244). Following the script with validated_reactions.csv as input, it has stopped because the dataframe generated has no unique() attribute.
It seems that the pandas series named "info" is not extracting anything with the specificity "P" because no reaction ID has this identifier. So later for unique_doc it couldn't apply unique() because it is applied on an empty dataframe. I put these lines as a comment and it worked to generate the validated_reactions_all.csv file. Is this due to a change in the USPTO ID?
Also for "sib_sel = data["document_id"].isin(unique_doc)" (line 253) there is no "document_id" in the input data. Was it intended for responses from a source other than the USPTO?