Nelly-Barret / BETTER-fairificator

The fairification tools for BETTER project.
https://www.better-health-project.eu/
0 stars 0 forks source link

Manage identifiers #41

Closed Nelly-Barret closed 2 weeks ago

Nelly-Barret commented 2 weeks ago
Nelly-Barret commented 2 weeks ago

How to get the max value of a field in a table in MongoDB?

db["tablename"].find().sort({"fieldname": -1}).limit(1)

Update (see below): we cannot simply do it like this because we need to convert IDs to long values (int is too small) to have a numeric comparison, not a string one.

Nelly-Barret commented 2 weeks ago

I created the class Identifier at https://github.com/Nelly-Barret/BETTER-fairificator/pull/46

Nelly-Barret commented 2 weeks ago

It is not possible to sort simply sort the resources by their identifier because their identifiers are strings for now, thus we may have this:

better_default> db["ExaminationRecord"].aggregate([{'$project': {'identifier.value': 1}}, {'$sort': {'identifier.value': -1}}, {'$limit': 4}])
[
  {
    _id: ObjectId('666ac699db796b91f974f07e'),
    identifier: { value: 'ExaminationRecord/99' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0be'),
    identifier: { value: 'ExaminationRecord/412' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0bd'),
    identifier: { value: 'ExaminationRecord/411' }
  },
  {
    _id: ObjectId('666ac699db796b91f974f0bc'),
    identifier: { value: 'ExaminationRecord/410' }
  }
]

It seems that there are two options:

  1. build int identifiers, such as 1, 2, 3, etc, and use the resource type to build a unique URI out of it when needed
  2. parse the strings and locally convert them to int within the MongoDB query which computes the min/max.

It seems that option 2. is better because (i) we do not have to compute the URI of the resource whenever we need to refer to it, and (ii) the min/max in the ETL will be used only once when the ETL script is called (and the catalogue will implement its own min/max and other aggregation functions)

Nelly-Barret commented 2 weeks ago

With a Long-converted value, I can get the correct max from the db:

better_default> db["ExaminationRecord"].aggregate([
  {"$project": { "identifier.value": { "$split": ["$identifier.value", "/"]}}}, 
  {"$unwind": "$identifier.value"}, 
  {"$match": {"identifier.value": /[0-9]+/}}, 
  {"$group": {"_id": "identifier.value", "Max": {"$max": { "$toLong": "$identifier.value" }}}}
])

returns

[ { _id: 'identifier.value', Max: Long('412') } ]

(and not 99) as in the previous post

Nelly-Barret commented 2 weeks ago

I implemented the usage of the current max id in the database. I can still see a bug: several ExaminationRecord instances have the same identifier. This may be because I have to increment the counter by 1 after set it to the maximum, otherwise we start again at the max value (thus creating duplicate identifiers).

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
  {
    _id: ObjectId('666b105bdb796b91f9750db1'),
    basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 17:29:31',
    resourceType: 'ExaminationRecord',
    value: 374
  },
  {
    _id: ObjectId('666b107edb796b91f9750ec5'),
    basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/18/2024, 17:30:06',
    resourceType: 'ExaminationRecord',
    value: '0,32'
  },
  {
    _id: ObjectId('666b1094db796b91f9750fd9'),
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/19/2024, 17:30:28',
    resourceType: 'ExaminationRecord',
    value: 242
  }
]

See also the Sample identifier which is not properly set (it should be a dict with a value). See issue https://github.com/Nelly-Barret/BETTER-fairificator/issues/48

Update: The bug was not here, the problem was that I was computing the max ID but I forgot to set the counter to this value 👹

Nelly-Barret commented 2 weeks ago

To easily check wether there are duplicate identifiers, we can check for duplicates like this: (https://stackoverflow.com/questions/14770170/how-to-find-mongo-documents-with-a-same-field):

db.mycollection.aggregate(
    { $group: { 
        // Group by fields to match on (a,b)
        _id: { a: "$a", b: "$b" },

        // Count number of matching docs for the group
        count: { $sum:  1 },

        // Save the _id for matching docs
        docs: { $push: "$_id" }
    }},

    // Limit results to duplicates (more than 1 match) 
    { $match: {
        count: { $gt : 1 }
    }}
)

The query to check duplicates returns nothing:

db.mycollection.aggregate({ "$group": { _id: { "identifier": "$identifier.value" }, count: {$sum: 1}, docs: {$push: "$_id"}}}, {$match: {count: { $gt: 1}}})

However, there ARE duplicates:

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
  {
    _id: ObjectId('666bebd1cf94003d0a36a619'),
    basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 09:05:53',
    resourceType: 'ExaminationRecord',
    value: 374
  },
  {
    _id: ObjectId('666beccccf94003d0a36a762'),
    subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/23/2024, 09:10:04',
    resourceType: 'ExaminationRecord',
    value: '0,32'
  },
  {
    _id: ObjectId('666becdbcf94003d0a36a874'),
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/16/2024, 09:10:19',
    resourceType: 'ExaminationRecord',
    value: 242
  }
]
Nelly-Barret commented 2 weeks ago

I found the suitable query (https://stackoverflow.com/a/26985011):

db["ExaminationRecord"].aggregate([
    { "$group": { "_id": "$identifier", "count": { "$sum": 1 } } }, 
    { "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } }, 
    { "$project": { "name": "$_id", "_id": 0 } }
])

It does identify the duplicates when the identiier is simple (a str) because it returns several lines, including the ExaminationRecord/99. However, it returns nothing with the complex identifier ({ "identifier": { "value" : "A/1" }}); so I don't know this is whether I fixed the resource counter or the query does not find duplicates for complex variables...

Check for dupliated IDs

Nelly-Barret commented 2 weeks ago

The solution with the Counter (https://stackoverflow.com/questions/1045344/how-do-you-create-an-incremental-id-in-a-python-class) works well when running the code, but I have troubles in tests (probably because this is a static integer, thus is not reset appropriately).

Instead of clearing the Counter before each test (which would be a pain), I prefer to make it non-static (and this will also simply our lives if we develop further scripts/web pages, etc).

I may follow the model used in ConnectionLens:

Similarly, I could:

Nelly-Barret commented 2 weeks ago

Done and that woks well!!

Merged at https://github.com/Nelly-Barret/BETTER-fairificator/commit/ac52259ec23e9642970b86047bd1b570d91d1847