Open Nelly-Barret opened 1 month ago
For 3., this cannot be done with a simple upsert because the upsert does not return the matched tuple(s). Instead, one can rely on findAndModify (https://stackoverflow.com/questions/16358857/mongodb-atomic-findorcreate-findone-insert-if-nonexistent-but-do-not-update) with upsert=True. In Pymongo, this seems to be find_one_and_update(...)
findAndModify(...)
can:
Bulk.find(<query>).upsert().updateOne(<update>);
return_document=BEFORE
because it will not return the inserted document; and I cannot use either a simple return_document=AFTER
because it does not say whether the returned instane is an update or an insert. Maybe we can use the chain of Bulk to get the status (found but not updated vs. not found thus inserted) of the returned element?
After some thinking, I realized that I need to distinguish two things:
For what makes a resource unique, we need to specify in the filter
of the update/upsert operations the set of fields that make an instance unique based on its type.
For what is used in the references, this is (so far) the array of identifiers, each composed of a value
and a use
. As there is always one identifier for a single instance, the array seems useless; the use
seems also useless as we know that only Patient and Sample instances have IDs assigned by hospitals, others have an ID assigned by the code. So, the identifier could be simplify to a single string. Does FHIR accepts this? By default no (it expects from 0 to N Identifiers), but we may be able to extend this and set it as a string. Using a single string to identify resource instances is much more lightweight and will be easier with the revised way to insert/retrieve data from the database (see below).
Revised pipeline to insert/retrieve data from the database:
Hospital
instance and upsert it Examination
instances, and every 1K upsert them (and delete them from the in-memory arrays)Sample
instances, and every 1K upsert them (...)Patient
instances, and every 1K upsert them (...)ExaminationRecord
instances, and every 1K upsert them (...). Do NOT retrieve them as they are never used as references later in the pipeline. Side note: The code nor the database will raise an Error if the referred instance does not exist, because (i) the code does not check for it, so in principle, one could insert the data in any order; and (ii) MongoDB does NOT support foreign key, primary key relationship (and it encourages to put relevant data into single collection rather than putting into multiple collections and then map those multiple collections by foreign key relationships)
The above pipeline requires us to maintain, in-memory, maps to map the CSV elements to the actual IDs it has been assigned.
hospital name <-> Hospital ID
CSV column name <-> Examination ID
CSV disease name <-> Disease ID
This is coded at https://github.com/Nelly-Barret/BETTER-fairificator/pull/14
I will have to do it for Disease when I will integrate them in the pipeline
Following #3, I am able to upsert one or many Resource instances. Now, the ETL needs to know what to do with its own (Python) Resource instances vs. what is in the database.
The idea is to upsert the Resource instance, i.e., insert it if it does not exist and do nothing otherwise. Now, the thing is that this may cause a discrepency between the objects living in-memory and the database. To avoid this, I think it is more reasonnable to:
upsert_one_tuple()
method with the corresponding Resource instances