Nelly-Barret / BETTER-fairificator

The fairification tools for BETTER project.
https://www.better-health-project.eu/
0 stars 0 forks source link

What does make an instance unique? #3

Closed Nelly-Barret closed 2 weeks ago

Nelly-Barret commented 1 month ago

In order to not delete and rebuild the whole database whenever the ETL script is run, it is important to be able to detect that/when we are trying to insert instances already existing in the database.

Hospital instances

Patient instances

Sample instances

The same applies or Sample instances because they have a unique SampleBarcode.

Examination instances

ExaminationRecord instances

Disease and DiseaseRecord instances

Same applies for what as been described for Exmination and ExaminationRecord instances.

Nelly-Barret commented 1 month ago

A way to be efficient is to use upsert, i.e., update or insert:

https://www.mongodb.com/docs/manual/reference/method/db.collection.updateMany/

Nelly-Barret commented 1 month ago

I could upsert:

This is done with the following line: self.db[table_name].update_one(filter=filter_dict, update={"$setOnInsert": one_tuple}, upsert=True) where:

Concerning the update policy:

SO post about $setOnInsert: https://stackoverflow.com/questions/30745474/mongodb-upsert-with-empty-update-document

Nelly-Barret commented 1 month ago

Interesting post discussing bulk operations for upsert: https://stackoverflow.com/questions/5292370/fast-or-bulk-upsert-in-pymongo

Official MongoDB doc about Bulk operations and upsert: https://www.mongodb.com/docs/manual/reference/method/Bulk.find.upsert/

Okayy so the Bulk write with X upsert operation seems to be the way to go: https://stackoverflow.com/questions/63831785/how-can-i-make-a-bulk-upsert-query-with-pymongo

However, this will apparently still execute each statement one at a time; which does not look efficient, even with indexes 🤔

Nelly-Barret commented 1 month ago

A Bulk operation allows to send many operations within a single call to the database instead of having to do as many calls to the db as they are operations.

https://www.mongodb.com/docs/languages/python/pymongo-driver/current/write/bulk-write/

I think that, in the end, one cannot afford to not do a loop over all the operations to do... So at least, we have a single call to the database. I think we can send 1000 operations at a time, in order to limit database calls but not overload the system with hunderds of operations in a single database call. Especially because a bulk operation is limited to 16Mb by MongoDB.

Nelly-Barret commented 1 month ago

Regarding updateOne() vs updateMany():

So, do I need UpdateOne or UpdateMany?
In principle, the filter will select only one document as it is supposed that we are able to identify uniquely each Resource. In reality, the fact that we match one or several documents does not matter much because we do not update the data (see #4), we simply insert if no document matches the filter.

Nelly-Barret commented 2 weeks ago

See #8 for more details about how I implemented the resource identification