CDCgov / phdi

https://cdcgov.github.io/dibbs-site/
Creative Commons Zero v1.0 Universal
35 stars 14 forks source link

SPIKE: More robust way to retrieve data from the TCR #2608

Closed m-goggins closed 1 month ago

m-goggins commented 2 months ago

Because all migrations in flyway run one after another, we are currently importing all data from the TCR in migration 2 using a SQL script that is 175k lines long. We should consider looking into alternative methods to ensuring access to data in the TCR. This could be other forms of porting over the codes from the TCR or reconsidering whether we should spin up the TCR as a separate service and make calls to that service.

robertandremitchell commented 1 month ago

if we were to port over the code, with the new repo in mind, would we use the dibbs.cloud/ endpoint potentially for parsing the data we get back? Right now, the eRSD:

  1. Load eRSD data json from APHL API.
  2. Post eRSD data json to message-parser endpoint to get parsed data.
  3. Use parsed data to create list of lists for each table.
    • One addition we made last time to refactor / get all the data we need was to read in the GEMs files here to go from ICD-9 to ICD-10. Will we need/want this as well?
    • Another recent addition was adding the VSAC API to get the version of each valueset and conditions, which are not natively present in the eRSD table. At least as currently written, this greatly bloats the time to build the tables from a job that took seconds to a job that takes 5-10 minutes, last time I ran it, because it is making like ~1500 of calls to the API to get that data. That is probably fine for an initial seeding of the database, maybe? But I wanted to raise as a potential issue with rebuilding tables exclusively from API calls.
  4. Delete existing tables in sqlite database.
    • How might we handle this? We will want, in theory, a way to load new APHL data and not start from scratch, but rather have a new version of their data, right? Perhaps relatedly, but how might we handle a scenario where a user uploads a OID, the new version of the eRSD then adds the OID, creating a conflict? Probably thornier than we need to think about rn.
  5. Create tables and add indexes.
  6. Insert data into tables.
robertandremitchell commented 1 month ago

I'm pretty on the fence about whether to port the codes over vs. spinning up TCR. I think the benefit of spinning up TCR is that if other teams are using TCR, maintaining it is potentially easier. However, we already have a handful of custom codes that get away from the eRSD (newborn screening, added medication codes, etc.) that at least in the short run there will still be the need within TCR for some hardcoding.

soooo with that in mind, I come back around toward doing something specific to the query connector, whether that's writing something specific in typescript or porting over the python.

We will need some kind of persistent logging of eRSD versions, I think? Is the use case potentially different where we would only want the latest? in which case we could just use the API.