NCATSTranslator / Tests

3 stars 2 forks source link

Noticed that all tools are failing some tests #32

Open colleenXu opened 6 months ago

colleenXu commented 6 months ago

In the latest run (3/10), I noticed that 24 tests were not passed by any tool. These tests may warrant a closer look to figure out what's going on.

Case Number Asset Number Name
8 8 TopAnswer: fingolimod_treats_Multiple_Sclerosis
8 9 TopAnswer: natalizumab_treats_Multiple_Sclerosis
21 14 TopAnswer: interferon-beta_1a_treats_Multiple_Sclerosis
19 16 TopAnswer: desferrioxamine_treats_Aceruloplasminemia
32 32 TopAnswer: Lactase_treats_Lactose_intolerance
12 69 Acceptable: Soot_treats_Obstructive_Sleep_Apnea
23 73 TopAnswer: Mipomersen_treats_Homozygous_Familial_Hypercholesterolemia
0 79 TopAnswer: progestin_treats_Premature_Menopause
8 116 Acceptable: Anticholinergic_agents_treats_Multiple_Sclerosis
9 180 Acceptable: Pegloticase_treats_Gout
20 183 NeverShow: Riley-day_Syndrome_treats_Hereditary_Sensory_And_Autonomic_Neuropathy
5 186 TopAnswer: Insulin_human_treats_Diabetes_Mellitus
34 262 TopAnswer: AZD_3355_treats_Gastroesophageal_Reflux_Disease
34 264 TopAnswer: Talcid_treats_Gastroesophageal_Reflux_Disease
19 303 TopAnswer: deferipone_treats_Aceruloplasminemia
19 304 TopAnswer: deferasirox__treats_Aceruloplasminemia
21 306 TopAnswer: interferon-beta_1b_treats_Multiple_Sclerosis
37 315 TopAnswer: Fluticasone_treats_Asthma
37 316 Acceptable: Albuterol_(salbutamol)_treats_Asthma
39 322 Acceptable: Erythromycin_treats_Idiopathic_bronchiectasis
40 325 Acceptable: Fostamatinib_treats_Idiopathic_pulmonary_fibrosis
40 326 Acceptable: Acceptable: GLPG1690_(Ziritaxestat)_treats_Idiopathic_pulmonary_fibrosis
41 328 Acceptable: Ridaforolimus_treats_Lymphangioleiomyomatosis
42 330 Acceptable: Ensifentrine_treats_Primary_ciliary_dyskinesia
colleenXu commented 6 months ago

Addition! For TestCase_26, it looks like the input curie has an extra trailing space: "input_curie": "MONDO:0018958 ". And the corresponding assets (174-177) all have this extra space.

This probably explains why the tools are skipping the query or have no results/throw errors for all the assets

maximusunc commented 6 months ago

@sierra-moxon could this issue potentially be moved to the feedback repo for TAQA?

jaredroach commented 6 months ago

See some of the other issues for specific explanations of some of these universal failures. e.g. https://github.com/NCATSTranslator/Tests/issues/93 for Asset NCATSTranslator/Feedback#79.

jaredroach commented 6 months ago

Re: Asset NCATSTranslator/Feedback#69 "Acceptable: Soot_treats_Obstructive_Sleep_Apnea"

  1. Why is this acceptable? Shouldn't it be "Never Show"? Where is the research that shows soot treats OSA?? Or other inference? If anything, soot causes OSA. Currently the answer from Translator for "What drugs treat OSA" does not include soot, so I cannot query the provenance via Translator. https://ui.ci.transltr.io/main/results?l=Obstructive%20Sleep%20Apnea%20Syndrome&i=MONDO:0007147&t=0&r=0&q=670443ba-f001-450b-9709-c2d9d54c7d65

  2. Asset NCATSTranslator/Feedback#69 is a duplicate of Asset NCATSTranslator/Feedback#68 (also "Acceptable: Soot_treats_Obstructive_Sleep_Apnea"). Which a number of Tools are passing.

  3. This is the Information Radiator description for Asset NCATSTranslator/Feedback#69 https://informationradiator.renci.org/test-runs/37/tests/5975#log-0 Calling ARS Test Runner with: { "environment": "ci", "predicate": "treats", "runner_settings": [ "inferred" ], "expected_output": "Acceptable", "input_curie": "MONDO:0007147", "output_curie": "MESH:D053260" }

  4. This is the (blank!) Information Radiator description for Asset NCATSTranslator/Feedback#68 https://informationradiator.renci.org/test-runs/37/tests/5974#log-0 No logs

Maybe some tools are passing NCATSTranslator/Feedback#68 b/c it is blank??

maximusunc commented 6 months ago

@jaredroach The difference between Asset NCATSTranslator/Feedback#68 and NCATSTranslator/Feedback#69 are that the former is NeverShow and the latter is Acceptable. Both are listed because they could be perceived as "correct" by multiple different user personas. An argument could be made that the NeverShow test should never happen, but that is more a question for TAQA.

The Information Radiator is not without its bugs, and that is what is happening with the "No logs". Unless something catastrophic has happened with the tests, there should always be some logs, but you just have to refresh the page if it ends up showing you that message.

jaredroach commented 6 months ago

@maximusunc If I understand it correctly, then a Tool will fail an "Acceptable" test if it does not report the result in the top 50%. That means that Tools that do not report "soot" as a top-50% treatment for OSA are going to get dinged. Which is definitely not the intent of these Tests. We should delete Asset NCATSTranslator/Feedback#69.

1_TopAnswer: The expected output is in the top 10% of results 2_Acceptable: The expected output is in the top 50% of results 3_BadButForgivable: The expected output is either not present or in the bottom 50% of results 4_NeverShow: The expected output is not in the results

sandrine-muller-research commented 5 months ago

Hi @jaredroach The reason I put soot treats sleep apnea as a mechanistics acceptable answer is that, it is likely one of the cause so by doing operations between differtent results (including the cause) I am likely to get better results at the graph that I would be looking for

jaredroach commented 5 months ago

@sandrine-muller-research OK. But you are not disagreeing with me that Asset NCATSTranslator/Feedback#69 should be deleted, are you? We don't want to penalize tools that don't report soot.

sandrine-muller commented 5 months ago

When this sheet has been done the rule "2_Acceptable: The expected output is in the top 50% of results" was not defined like that and we had another definition of acceptable if my memory serves me well, which was dependent on the persona. I will remove those lines and build another suite from it that will be focused of user preference and will have possible inconsistent results depending on personas. The tests will be different than a pass/fail test.