Insert/retrieve Resource instances in/from the database

Nelly-Barret commented 1 month ago

Following #3, I am able to upsert one or many Resource instances. Now, the ETL needs to know what to do with its own (Python) Resource instances vs. what is in the database.

The idea is to upsert the Resource instance, i.e., insert it if it does not exist and do nothing otherwise. Now, the thing is that this may cause a discrepency between the objects living in-memory and the database. To avoid this, I think it is more reasonnable to:

call the upsert_one_tuple() method with the corresponding Resource instances
if the Resource instance is inserted, this means that there was no Resource instance like this in the database and we what inserted exactly what we have in memory, thus we are good (nothing more to do)
if the Resource instance is NOT inserted, this means that this Resource instance already exists in the database. Thus, to be sure that what we have in the DB is exactly what we have in memory, we retrieve the matched Resource instance and load it in memory.

Nelly-Barret commented 1 month ago

For 3., this cannot be done with a simple upsert because the upsert does not return the matched tuple(s). Instead, one can rely on findAndModify (https://stackoverflow.com/questions/16358857/mongodb-atomic-findorcreate-findone-insert-if-nonexistent-but-do-not-update) with upsert=True. In Pymongo, this seems to be find_one_and_update(...)

Nelly-Barret commented 1 month ago

findAndModify(...) can:

be used for single updates only as it does not exist as a class in PyMongo. An alternative is to chain the operations as showed in https://www.mongodb.com/docs/manual/reference/method/Bulk.find.upsert/#Bulk.find.upsert: Bulk.find(<query>).upsert().updateOne(<update>);
does not say whether the returned instance has been updated or is the newly created one.
- it is suggested to keep track of the resource version to identify whether this is a newly inserted instance or an updated one (in order to avoid a second call to llok for the resource to know whether it exists or not): https://stackoverflow.com/questions/57902393/pymongo-identify-if-document-returned-by-find-one-and-update-is-updated-or-inse
- using a timestamp is also suggested here: https://www.codemzy.com/blog/mongodb-findoneandupdate-with-upsert
- in any case, I cannot use a simple return_document=BEFORE because it will not return the inserted document; and I cannot use either a simple return_document=AFTER because it does not say whether the returned instane is an update or an insert.

Maybe we can use the chain of Bulk to get the status (found but not updated vs. not found thus inserted) of the returned element?

Nelly-Barret commented 1 month ago

After some thinking, I realized that I need to distinguish two things:

what is used to make a resource unique (see #3)
what is used in the references when building new instances refering to existing instances

For what makes a resource unique, we need to specify in the filter of the update/upsert operations the set of fields that make an instance unique based on its type.

For what is used in the references, this is (so far) the array of identifiers, each composed of a value and a use. As there is always one identifier for a single instance, the array seems useless; the use seems also useless as we know that only Patient and Sample instances have IDs assigned by hospitals, others have an ID assigned by the code. So, the identifier could be simplify to a single string. Does FHIR accepts this? By default no (it expects from 0 to N Identifiers), but we may be able to extend this and set it as a string. Using a single string to identify resource instances is much more lightweight and will be easier with the revised way to insert/retrieve data from the database (see below).

Revised pipeline to insert/retrieve data from the database:

Create the Hospital instance and upsert it
Retrieve the identifier of the hospital instance
Create Examination instances, and every 1K upsert them (and delete them from the in-memory arrays)
Retrieve the identifiers of all the Examination instances.
Create the Sample instances, and every 1K upsert them (...)
Retrieve ...
Create the Patient instances, and every 1K upsert them (...)
Retrieve ...
Create the ExaminationRecord instances, and every 1K upsert them (...). Do NOT retrieve them as they are never used as references later in the pipeline.

Side note: The code nor the database will raise an Error if the referred instance does not exist, because (i) the code does not check for it, so in principle, one could insert the data in any order; and (ii) MongoDB does NOT support foreign key, primary key relationship (and it encourages to put relevant data into single collection rather than putting into multiple collections and then map those multiple collections by foreign key relationships)

Nelly-Barret commented 1 month ago

The above pipeline requires us to maintain, in-memory, maps to map the CSV elements to the actual IDs it has been assigned.

For Patient and Sample instances, we don't need any map because they already take their ID from the CSV data
For Hospital instances, we need a map hospital name <-> Hospital ID
For Examination instance, we need a map CSV column name <-> Examination ID
For Disease instances, we need a map CSV disease name <-> Disease ID

Nelly-Barret commented 4 weeks ago

This is coded at https://github.com/Nelly-Barret/BETTER-fairificator/pull/14

I will have to do it for Disease when I will integrate them in the pipeline

Nelly-Barret / BETTER-fairificator

Insert/retrieve Resource instances in/from the database #8