Manage identifiers - Githubissues

Nelly-Barret commented 2 weeks ago

[x] I need to create a class for Identifiers to respect my data model (and be more FHIR-compliant)
[x] I also need to get the latest ID in the database (over all the tables except Patient and Sample which have their IDs assigned by the hospitals) and set the Resource COUNTER_ID with that max value. Otherwise, everything the ETL is run, it will start number resources with id 1.

Nelly-Barret commented 2 weeks ago

How to get the max value of a field in a table in MongoDB?

db["tablename"].find().sort({"fieldname": -1}).limit(1)

Update (see below): we cannot simply do it like this because we need to convert IDs to long values (int is too small) to have a numeric comparison, not a string one.

Nelly-Barret commented 2 weeks ago

I created the class Identifier at https://github.com/Nelly-Barret/BETTER-fairificator/pull/46

Nelly-Barret commented 2 weeks ago

It is not possible to sort simply sort the resources by their identifier because their identifiers are strings for now, thus we may have this:

better_default> db["ExaminationRecord"].aggregate([{'$project': {'identifier.value': 1}}, {'$sort': {'identifier.value': -1}}, {'$limit': 4}])
[
  {
    _id: ObjectId('666ac699db796b91f974f07e'),
    identifier: { value: 'ExaminationRecord/99' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0be'),
    identifier: { value: 'ExaminationRecord/412' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0bd'),
    identifier: { value: 'ExaminationRecord/411' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0bc'),
    identifier: { value: 'ExaminationRecord/410' }
  }
]

It seems that there are two options:

build int identifiers, such as 1, 2, 3, etc, and use the resource type to build a unique URI out of it when needed
parse the strings and locally convert them to int within the MongoDB query which computes the min/max.

It seems that option 2. is better because (i) we do not have to compute the URI of the resource whenever we need to refer to it, and (ii) the min/max in the ETL will be used only once when the ETL script is called (and the catalogue will implement its own min/max and other aggregation functions)

Nelly-Barret commented 2 weeks ago

With a Long-converted value, I can get the correct max from the db:

better_default> db["ExaminationRecord"].aggregate([
  {"$project": { "identifier.value": { "$split": ["$identifier.value", "/"]}}}, 
  {"$unwind": "$identifier.value"}, 
  {"$match": {"identifier.value": /[0-9]+/}}, 
  {"$group": {"_id": "identifier.value", "Max": {"$max": { "$toLong": "$identifier.value" }}}}
])

returns

[ { _id: 'identifier.value', Max: Long('412') } ]

(and not 99) as in the previous post

Nelly-Barret commented 2 weeks ago

I implemented the usage of the current max id in the database. I can still see a bug: several ExaminationRecord instances have the same identifier. This may be because I have to increment the counter by 1 after set it to the maximum, otherwise we start again at the max value (thus creating duplicate identifiers).

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
  {
    _id: ObjectId('666b105bdb796b91f9750db1'),
    basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 17:29:31',
    resourceType: 'ExaminationRecord',
    value: 374
  },
  {
    _id: ObjectId('666b107edb796b91f9750ec5'),
    basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/18/2024, 17:30:06',
    resourceType: 'ExaminationRecord',
    value: '0,32'
  },
  {
    _id: ObjectId('666b1094db796b91f9750fd9'),
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/19/2024, 17:30:28',
    resourceType: 'ExaminationRecord',
    value: 242
  }
]

See also the Sample identifier which is not properly set (it should be a dict with a value). See issue https://github.com/Nelly-Barret/BETTER-fairificator/issues/48

Update: The bug was not here, the problem was that I was computing the max ID but I forgot to set the counter to this value 👹

Nelly-Barret commented 2 weeks ago

To easily check wether there are duplicate identifiers, we can check for duplicates like this: (https://stackoverflow.com/questions/14770170/how-to-find-mongo-documents-with-a-same-field):

db.mycollection.aggregate(
    { $group: { 
        // Group by fields to match on (a,b)
        _id: { a: "$a", b: "$b" },

        // Count number of matching docs for the group
        count: { $sum:  1 },

        // Save the _id for matching docs
        docs: { $push: "$_id" }
    }},

    // Limit results to duplicates (more than 1 match) 
    { $match: {
        count: { $gt : 1 }
    }}
)

The query to check duplicates returns nothing:

db.mycollection.aggregate({ "$group": { _id: { "identifier": "$identifier.value" }, count: {$sum: 1}, docs: {$push: "$_id"}}}, {$match: {count: { $gt: 1}}})

However, there ARE duplicates:

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
  {
    _id: ObjectId('666bebd1cf94003d0a36a619'),
    basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 09:05:53',
    resourceType: 'ExaminationRecord',
    value: 374
  },
  {
    _id: ObjectId('666beccccf94003d0a36a762'),
    subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/23/2024, 09:10:04',
    resourceType: 'ExaminationRecord',
    value: '0,32'
  },
  {
    _id: ObjectId('666becdbcf94003d0a36a874'),
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 09:10:19',
    resourceType: 'ExaminationRecord',
    value: 242
  }
]

Nelly-Barret commented 2 weeks ago

I found the suitable query (https://stackoverflow.com/a/26985011):

db["ExaminationRecord"].aggregate([
    { "$group": { "_id": "$identifier", "count": { "$sum": 1 } } }, 
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } }, 
    { "$project": { "name": "$_id", "_id": 0 } }
])

It does identify the duplicates when the identiier is simple (a str) because it returns several lines, including the ExaminationRecord/99. However, it returns nothing with the complex identifier ({ "identifier": { "value" : "A/1" }}); so I don't know this is whether I fixed the resource counter or the query does not find duplicates for complex variables...

Check for dupliated IDs

One easy check to do is to do a test which: (i) get all the identfiers in the database; (ii) load them in a list in python; (iii) convert that list in a set; (iv) if they are of the same size, this means there are no duplicates!
Another way to go is to set the counter to 1 and to never increment it so that all resources have the ID 1. In that case, the query indeed reports that ExaminationRecord/1 is a duplicate. So, we should be god here, i.e., we should be able to identify duplicates if they exist.

Nelly-Barret commented 2 weeks ago

The solution with the Counter (https://stackoverflow.com/questions/1045344/how-do-you-create-an-incremental-id-in-a-python-class) works well when running the code, but I have troubles in tests (probably because this is a static integer, thus is not reset appropriately).

Instead of clearing the Counter before each test (which would be a pain), I prefer to make it non-static (and this will also simply our lives if we develop further scripts/web pages, etc).

I may follow the model used in ConnectionLens:

There is a class RelationalGraph, which has a nodeIdcounter variable (non static)
The graph is given to a datasource
The datasource uses that graph nodeIdCounter when parsing the file and generating the nodes.

Similarly, I could:

have a NON-static class Counter, with a NON-static counter
create a counter at the beginning of the ETL
pass it through the steps of the ETL
for the tests: one counter will be create for each test (this makes sense as tests should test only one feature at a time).

Nelly-Barret commented 2 weeks ago

Done and that woks well!!

Merged at https://github.com/Nelly-Barret/BETTER-fairificator/commit/ac52259ec23e9642970b86047bd1b570d91d1847

Nelly-Barret / BETTER-fairificator

Manage identifiers #41

Check for dupliated IDs