Nelly-Barret / BETTER-fairificator

The fairification tools for BETTER project.
https://www.better-health-project.eu/
0 stars 0 forks source link

Insert/retrieve Resource instances in/from the database #8

Open Nelly-Barret opened 1 month ago

Nelly-Barret commented 1 month ago

Following #3, I am able to upsert one or many Resource instances. Now, the ETL needs to know what to do with its own (Python) Resource instances vs. what is in the database.

The idea is to upsert the Resource instance, i.e., insert it if it does not exist and do nothing otherwise. Now, the thing is that this may cause a discrepency between the objects living in-memory and the database. To avoid this, I think it is more reasonnable to:

  1. call the upsert_one_tuple() method with the corresponding Resource instances
  2. if the Resource instance is inserted, this means that there was no Resource instance like this in the database and we what inserted exactly what we have in memory, thus we are good (nothing more to do)
  3. if the Resource instance is NOT inserted, this means that this Resource instance already exists in the database. Thus, to be sure that what we have in the DB is exactly what we have in memory, we retrieve the matched Resource instance and load it in memory.
Nelly-Barret commented 1 month ago

For 3., this cannot be done with a simple upsert because the upsert does not return the matched tuple(s). Instead, one can rely on findAndModify (https://stackoverflow.com/questions/16358857/mongodb-atomic-findorcreate-findone-insert-if-nonexistent-but-do-not-update) with upsert=True. In Pymongo, this seems to be find_one_and_update(...)

Nelly-Barret commented 1 month ago

findAndModify(...) can:

Maybe we can use the chain of Bulk to get the status (found but not updated vs. not found thus inserted) of the returned element?

Nelly-Barret commented 1 month ago

After some thinking, I realized that I need to distinguish two things:

For what makes a resource unique, we need to specify in the filter of the update/upsert operations the set of fields that make an instance unique based on its type.

For what is used in the references, this is (so far) the array of identifiers, each composed of a value and a use. As there is always one identifier for a single instance, the array seems useless; the use seems also useless as we know that only Patient and Sample instances have IDs assigned by hospitals, others have an ID assigned by the code. So, the identifier could be simplify to a single string. Does FHIR accepts this? By default no (it expects from 0 to N Identifiers), but we may be able to extend this and set it as a string. Using a single string to identify resource instances is much more lightweight and will be easier with the revised way to insert/retrieve data from the database (see below).

Revised pipeline to insert/retrieve data from the database:

  1. Create the Hospital instance and upsert it
  2. Retrieve the identifier of the hospital instance
  3. Create Examination instances, and every 1K upsert them (and delete them from the in-memory arrays)
  4. Retrieve the identifiers of all the Examination instances.
  5. Create the Sample instances, and every 1K upsert them (...)
  6. Retrieve ...
  7. Create the Patient instances, and every 1K upsert them (...)
  8. Retrieve ...
  9. Create the ExaminationRecord instances, and every 1K upsert them (...). Do NOT retrieve them as they are never used as references later in the pipeline.

Side note: The code nor the database will raise an Error if the referred instance does not exist, because (i) the code does not check for it, so in principle, one could insert the data in any order; and (ii) MongoDB does NOT support foreign key, primary key relationship (and it encourages to put relevant data into single collection rather than putting into multiple collections and then map those multiple collections by foreign key relationships)

Nelly-Barret commented 1 month ago

The above pipeline requires us to maintain, in-memory, maps to map the CSV elements to the actual IDs it has been assigned.

Nelly-Barret commented 4 weeks ago

This is coded at https://github.com/Nelly-Barret/BETTER-fairificator/pull/14

I will have to do it for Disease when I will integrate them in the pipeline