What does make an instance unique?

Nelly-Barret commented 1 month ago

In order to not delete and rebuild the whole database whenever the ETL script is run, it is important to be able to detect that/when we are trying to insert instances already existing in the database.

Hospital instances

This is easy, we simply have to check whether an instance with the hospital name already exists.
We don't have to worry too much about the efficiency here because (i) there will be only few Hospital instances, i.e., at most 7; and (ii) the hospital is created only once at the begining of the Transform step o the ETL.

Patient instances

This is easy too because they have a unique ID, provided by the hospitals, so we only need to check whether a Patient with the given ID already exists.
We need to worry about efficieny because (i) we may have a huge number of Patient instances, and (ii) for each line in the data, we need to check whether there exists a Patient instance for this patient ID.

Sample instances

The same applies or Sample instances because they have a unique SampleBarcode.

Examination instances

It is more complex because they do not have a particular ID, i.e., they only have an auto-increment as an ID. Now the question is: how to identify that an Examination has already been ingested and that we need to reuse it instead of re-creating it with a different auto-increment ID. Of course, we want to avoid the aforementioned situation because this would mean that we look the "unique definition" of each Examination.
We cannot use the Examination ID because this is an auto-increment
We can use the (set of) code(s) associated to the Examination.
- The question is: if two Examination instances E1 and E2 have only one code in common (in case E1 has LOINC code 123 and SNOMEDCT code 456, while E2 only has LOINC code 123), is it sufficient to say they represent the same Examination (and thus should not be inserted again)?
- It seems that the answer is yes because ontology codes are quite specific, thus only one code may be sufficient to uniquely identify an Examination. Also consider that several codes allow to get more knowledge about codes refering to the same concept across ontolgies, but one code for each Examination instance is also sufficient.

ExaminationRecord instances

It is also complex because each ExaminationRecord instance gets an auto-increment ID upon creation. An ExaminationRecord is an instance of the form (id, status, value, recorded_by = Reference(hospital), based_on = Reference(sample), instantiate = Reference(examination), subject = Reference(subject))
- So, ID cannot be used because this is simply an auto-increment
- Status is not useful either as this always values registered
- A combination of (Patient, Sample, Hospital, Examination) seems to be the right way to identify an ExaminationRecord instance.
- For the given combination, it may happen that the value will be different between what is already in the data and what we try to insert. In that case, we need to specify an "update policy" specifying whether the value should be updated with the new value.

Disease and DiseaseRecord instances

Same applies for what as been described for Exmination and ExaminationRecord instances.

Nelly-Barret commented 1 month ago

A way to be efficient is to use upsert, i.e., update or insert:

https://www.mongodb.com/docs/manual/reference/method/db.collection.updateMany/

Nelly-Barret commented 1 month ago

I could upsert:

a single Hospital instance "Buzzi 2", which has been added to the database
a single Hospital instance "Buzzi", which already exists, thus nothing happened

This is done with the following line: self.db[table_name].update_one(filter=filter_dict, update={"$setOnInsert": one_tuple}, upsert=True) where:

filter is a dict with the fields that make a Resource instance unique, e.g., the name of an Hospital instance, etc
one_tuple which is the "real" Resource instance to insert if it does not exist

Concerning the update policy:

$set replaces the matching tuple with the given resource, or insert it if it does not find it
$setOnInsert does not update the matching tuple, it only inserts it if it does not find it.

SO post about $setOnInsert: https://stackoverflow.com/questions/30745474/mongodb-upsert-with-empty-update-document

Nelly-Barret commented 1 month ago

Interesting post discussing bulk operations for upsert: https://stackoverflow.com/questions/5292370/fast-or-bulk-upsert-in-pymongo

Official MongoDB doc about Bulk operations and upsert: https://www.mongodb.com/docs/manual/reference/method/Bulk.find.upsert/

Okayy so the Bulk write with X upsert operation seems to be the way to go: https://stackoverflow.com/questions/63831785/how-can-i-make-a-bulk-upsert-query-with-pymongo

However, this will apparently still execute each statement one at a time; which does not look efficient, even with indexes 🤔

Nelly-Barret commented 1 month ago

A Bulk operation allows to send many operations within a single call to the database instead of having to do as many calls to the db as they are operations.

https://www.mongodb.com/docs/languages/python/pymongo-driver/current/write/bulk-write/

I think that, in the end, one cannot afford to not do a loop over all the operations to do... So at least, we have a single call to the database. I think we can send 1000 operations at a time, in order to limit database calls but not overload the system with hunderds of operations in a single database call. Especially because a bulk operation is limited to 16Mb by MongoDB.

Nelly-Barret commented 1 month ago

Regarding updateOne() vs updateMany():

UpdateOne updates the first document that matches your query filter
UpdateMany updates all documents that match your query filter.

So, do I need UpdateOne or UpdateMany?
In principle, the filter will select only one document as it is supposed that we are able to identify uniquely each Resource. In reality, the fact that we match one or several documents does not matter much because we do not update the data (see #4), we simply insert if no document matches the filter.

Nelly-Barret commented 2 weeks ago

See #8 for more details about how I implemented the resource identification

Nelly-Barret / BETTER-fairificator