MaterialEyes / exsclaim

A toolkit for the automatic construction of self-labeled materials imaging datasets from scientific literature
GNU General Public License v3.0
30 stars 8 forks source link

Non deterministic behavior in caption assignment #5

Closed trevorspreadbury closed 3 years ago

trevorspreadbury commented 4 years ago

Using SpaCy 2.1 and running test_e2e.py (All tests can be run by running python -m unittest discover), succeeds and fails seemingly at random. Printing out the DeepDiff of the expected and resultant json, the difference is in a single caption.

{'values_changed': {"root['s41467-018-06211-3_fig5.jpg']['master_images'][0]['caption'][0]": {'new_value': 'Precious metal dissolution tests in aluminum–air flow batteries (AAFBs) using the SMNp and Pt/C with 6 \u2009 M KOH electrolyte after 6\u2009h of discharging at 50 \u2009 mA \u2009 cm−2', 'old_value': 'c, d'}}}

This is referring to the following image, so neither suggested caption is especially good: image

Running 'python -m unittest discover' repeatedly I found failure, failure, success, success, ... This could be caused by the issue described here. I tried upgrading to SpaCy 2.3 where the issue may be resolved, but the en_core_web_sm model is different. Recommended course of action is upgrading SpaCy and seeing if results offer an improvement. Even if they are slightly worse, it may be better to remain up to date and eliminate nondeterminism.

trevorspreadbury commented 3 years ago

This seems to be caused by https://github.com/explosion/spaCy/issues/3182. If the nondeterminism is a problem, you can use the listed solution to select the random seed before using spacy.