Improved Reproducibility

tokebe commented 1 year ago

There has been plenty of investigation into BTE's result reproducibility due to the benchmarking push, but it should be organized under a single issue so that known information can be found easily and further discussion can be carried forward.

tokebe commented 1 year ago

Nondeterministic result sorting

When results are scored, the sorting is done by score, however two results of the same score are not guaranteed to be sorted in the same order.

Similarly, when results are not scored, no sorting is done.

This has a simple(-ish) fix: Prior to sorting by score, results should be sorted by a hash of the result, so that they always retain the same order. When results are sorted by score, results of the same score will retain this order, so both cases are covered.

This hash should be a hash combining all edge hashes, as well as all subclass edge hashes. This should be sufficient to ensure a deterministic sorting.

colleenXu commented 1 year ago

From Slack convo: It's not clear to me that operations (QXEdges?) -> subqueries are being run in the same order (scrolling too?) for multiple runs.

Pasted Slack convo

Jackson Callaghan :face_with_thermometer: [5 days ago](https://suwulab.slack.com/archives/CC218TEKC/p1696441773553919?thread_ts=1696028537.101489&cid=CC218TEKC) Actually, wait. I don't think we need to worry about that one at all - if operations are run deterministically, which they should be, then even if further subqueries are skipped they order of what's being skipped doesn't change. Whoops :+1: 1 colleenxu [5 days ago](https://suwulab.slack.com/archives/CC218TEKC/p1696466710625269?thread_ts=1696028537.101489&cid=CC218TEKC) hmmmm I'm wondering if we have confirmed / tested whether operations are run deterministically. Is there an easy way to review this, other than manually comparing logs? Jackson Callaghan :face_with_thermometer: [4 days ago](https://suwulab.slack.com/archives/CC218TEKC/p1696510999407359?thread_ts=1696028537.101489&cid=CC218TEKC) There's not a very easy way to review that, no Jackson Callaghan :face_with_thermometer: [4 days ago](https://suwulab.slack.com/archives/CC218TEKC/p1696511172100499?thread_ts=1696028537.101489&cid=CC218TEKC) But, if smartapi returns its list deterministically, it's a very good chance BTE will operate deterministically from there. We could hypothetically sort our operations somehow to ensure this.

cmaceves commented 1 year ago

Hi, showing some evidence for issues related to scoring of results. Here are two diff-able files for runs of the DrugCentral_creative benchmark, without API overrides and both run on 10/09/23. The "SCORES" section is sorted such that first node pairs both the runs have in common appear, and then node pairs that are unique to a run appear. LOGS are at the end of the file. There are clearly API failures, and I'm looking into how that corresponds to varying scores, because it's not clear to me if the scoring itself is at fault or the API failures. diff_10_09_run_1.txt diff_10_09_run_2.txt

andrewsu commented 1 year ago

Copying from @ngk5's slack comment:

Hi all, I cleaned up my notebook analyzing BTE’s variability in scoring and I’ve uploaded it to this repository along with the relevant raw results files (in case anyone wants to look further into my particular example). After discussion with @andrewsu, these are my conclusions/thoughts:

Edges and auxiliary graphs that are unique to a single run are only referenced in the KGs of the unique results of that run, thus variability in scoring is not due to additional edges found in one run but not the other.

Nondeterministic results sorting of ties doesn’t seem responsible for score variability, but may explain the unique results (especially at the 500-result truncation cutoff).

There are many examples where results between the two runs are identical (same edge bindings, auxiliary graphs, and edges within that auxiliary graph) except for the score. I am leaning towards this being caused by changes in the scoring method used (eg. NGD scoring), since these changes are not reliant on the components of the results KGs, which seem stable.

Looking further into possible silent API call failures, NGD scoring, or any cases of (non-unique) edges being referenced in different results KGs across the runs would be my next steps.

biothings / biothings_explorer

Improved Reproducibility #737

Nondeterministic result sorting