ExposuresProvider / icees-api

MIT License
2 stars 8 forks source link

Duplication/Configuration Issues Related to ICEES+ Asthma & ICEES DILI, Development & Production Instances #207

Closed karafecho closed 2 years ago

karafecho commented 2 years ago

This issue is focused primarily on issues related to the ICEES+ Asthma and ICEES+ DILI development and production instances, with the December 2021 Translator Demo in mind.

The results of various tests can be found here.

A summary of testing efforts follows below (from a Slack exchange between Kara, Kenny, and Max).


Okay, I just ran what I think is the final definitive test.

I looked up identifiers for DILIDx and AsthmaDx and also SexDILI and Sex2, using all four ICEES instances. DILIDx and SexDILI should only return responses from ICEES DILI; AsthmaDx and Sex2 should only return responses from ICEES Asthma.

Instead, all four ICEES instances returned responses to all four queries. The ICEES DILI and ICEES Asthma production instances returned incorrect identifiers, presumably because they are pointing to the outdated config files. The ICEES DILI and ICEES Asthma development instances returned correct identifiers, presumably because they are pointing to the ground truth config files.

Kenny: I bet if you ran the same asthma and DILI tests but pointed the asthma test at the DILI prod instance and the DILI test at the asthma prod instance, you'd receive the same results.

The actual data that are being returned to each query of dev and prod are correct. As noted above, but perhaps more firmly here, I'm proposing that we merge the asthma & DILI, dev & prod instances and expose one ICEES dev instance and one ICEES prod instance, using the ground truth config files.

karafecho commented 2 years ago

Regarding my proposal, and looking ahead to new use cases, a merger of all use cases into one ICEES dev instance and one ICEES prod instance will only work if we are very careful about distinguishing feature variable names. For example, "Race" can be reused for all UNC Health data, but not for DILI Network data or EPR data. But yet, some variables are intended for reuse across use cases. For example, the exposures variables.

Related, and as we've discussed, we also will need to rethink the use of SQLite as the ICEES db, given the limitations on column headers (2000). Hao's Dhall proposal might be the right solution, in terms of scalability, but I'm not qualified to make that decision.

karafecho commented 2 years ago

Related is issue #182.

karafecho commented 2 years ago

Note that my proposal would simplify #175 and #203.

karafecho commented 2 years ago

Update:

16339 - ICEES asthma dev, 16340 - copy ICEES asthma dev, 16341 - ICEES DILI dev, 16342 - ICEES asthma prod, 16343 - ICEES DILI prod

  1. Ground truth feature variables (names and identifiers) are being returned for both asthma and DILI. This should not be the case, but reflects the fact that a single set of YAML config files supports both ICEES instances. [Ground-truth tests focused on Sex2, AsthmaDx, PrednisoneRx (asthma variables) and SexDILI, DILIDx, PrednisoneOrCorticosteroids_Rx (DILI variables).]
  2. The non-Translator, non-KG functionalities in ICEES asthma dev return data for asthma only and those in ICEES DILI dev return data for DILI only.
  3. Some of the non-Translator, non-KG functionalities in ICEES asthma dev and ICEES DILI dev appear to be broken, specifically, "features" and "association to all features" (see example below for "association to all features" with Sex2), but this is likely a byproduct of the shared YAML config files (i.e., the functionality requires responses to variables not present in the underlying data).
  4. The one-hop KG queries are timing out after spinning for a long time when run through the Swagger interface, but not by way of direct CURLS. So, while Swagger had been reliable, it no longer is.
  5. ICEES asthma dev is responding to the B.1 one-hop query for MONDO:0005359 (DILI) and the C.2 one-hop queries for MONDO:0013433 (primary sclerosing cholangitis) and MONDO:0005388 (primary biliary cirrhosis). This should not be the case, as only ICEES DILI dev should be responding to these queries. This suggests that either the KG endpoint is not configured properly or the shared config files are breaking the KG endpoint.
  6. Related to (5), and according to the Dec demo workflow tracker, ICEES asthma and ICEES DILI are returning identical results to B.1. Both instances have (in the recent past) returned identical results to C.1 also, but timed out in today's test. Not too long ago, ICEES DILI was the only ICEES instance to respond to B.1, as it should be, but this appears to have broken when we deployed new instances using the ground-truth config files.

https://files.slack.com/files-pri/T0450N0TQ-F02NM1XBF52/image.png

https://files.slack.com/files-pri/T0450N0TQ-F02NM1XBF52/image.png

karafecho commented 2 years ago

Next steps:

  1. Kara will split the current ground-truth all_features and identifiers YAML files into four sets of files, one per use case, beginning with DILI, then asthma, then COVID, then PCD.
  2. Kara will add biolink:SmallMolecule to relevant variables in each set of YAML files, hopefully, programmatically.
  3. Hong will redeploy the ICEES asthma dev instance and the ICEES DILI dev instance at their respective ports after (1) and (2) are complete.
  4. Kenny will review the code used to support the KG one-hop endpoint.
  5. Kenny will update the SmartAPI registrations after (1-4) are complete and also take down any ICEES instances that are either broken or not needed.
karafecho commented 2 years ago

@hyi @kennethmorton @maximusunc : I ran a few KG one-hop tests using the three CURIES that only ICEES DILI should respond to. I also viewed Max's test results, using both QGraph and the ARAX viewer. I'm pretty sure that the KG one-hop endpoint is running pairwise correlations based on biolink:XXX categories/types, not the underlying data or input CURIE(s).

I think that Hao's 'master plan' was to create a set of three YAML config files and then run code to break them into Dhall files, with me selecting variables for each ICEES instance, like this one. What he may have overlooked in the interim is a need for a fourth YAML config file to map between biolink:XXX and csv column headers.

Regardless of the above, I think that fixes (1-5) above should resolve the issue in the near term, with an implementation of our planned redesign needed for the longer term.

karafecho commented 2 years ago

Issue resolved by splitting master ground truth all_features and identifiers YAML files into separate sets of files for each use case: asthma, dili, pcd, and covid. Completed for asthma, dili, and pcd.

karafecho commented 2 years ago

ICEES asthma prod and ICEES DILI prod deployed and tested.

karafecho commented 2 years ago

Resolved.