Closed Nelly-Barret closed 2 weeks ago
A way to be efficient is to use upsert, i.e., update or insert:
https://www.mongodb.com/docs/manual/reference/method/db.collection.updateMany/
I could upsert:
This is done with the following line: self.db[table_name].update_one(filter=filter_dict, update={"$setOnInsert": one_tuple}, upsert=True)
where:
filter
is a dict with the fields that make a Resource instance unique, e.g., the name of an Hospital instance, etcone_tuple
which is the "real" Resource instance to insert if it does not existConcerning the update policy:
$set
replaces the matching tuple with the given resource, or insert it if it does not find it$setOnInsert
does not update the matching tuple, it only inserts it if it does not find it. SO post about $setOnInsert: https://stackoverflow.com/questions/30745474/mongodb-upsert-with-empty-update-document
Interesting post discussing bulk operations for upsert: https://stackoverflow.com/questions/5292370/fast-or-bulk-upsert-in-pymongo
Official MongoDB doc about Bulk operations and upsert: https://www.mongodb.com/docs/manual/reference/method/Bulk.find.upsert/
Okayy so the Bulk write with X upsert operation seems to be the way to go: https://stackoverflow.com/questions/63831785/how-can-i-make-a-bulk-upsert-query-with-pymongo
However, this will apparently still execute each statement one at a time; which does not look efficient, even with indexes 🤔
A Bulk operation allows to send many operations within a single call to the database instead of having to do as many calls to the db as they are operations.
https://www.mongodb.com/docs/languages/python/pymongo-driver/current/write/bulk-write/
I think that, in the end, one cannot afford to not do a loop over all the operations to do... So at least, we have a single call to the database. I think we can send 1000 operations at a time, in order to limit database calls but not overload the system with hunderds of operations in a single database call. Especially because a bulk operation is limited to 16Mb by MongoDB.
Regarding updateOne()
vs updateMany()
:
UpdateOne
updates the first document that matches your query filterUpdateMany
updates all documents that match your query filter.So, do I need UpdateOne
or UpdateMany
?
In principle, the filter will select only one document as it is supposed that we are able to identify uniquely each Resource. In reality, the fact that we match one or several documents does not matter much because we do not update the data (see #4), we simply insert if no document matches the filter.
See #8 for more details about how I implemented the resource identification
In order to not delete and rebuild the whole database whenever the ETL script is run, it is important to be able to detect that/when we are trying to insert instances already existing in the database.
Hospital instances
Patient instances
Sample instances
The same applies or Sample instances because they have a unique SampleBarcode.
Examination instances
ExaminationRecord instances
(id, status, value, recorded_by = Reference(hospital), based_on = Reference(sample), instantiate = Reference(examination), subject = Reference(subject))
registered
Disease and DiseaseRecord instances
Same applies for what as been described for Exmination and ExaminationRecord instances.