Closed Nelly-Barret closed 2 weeks ago
How to get the max value of a field in a table in MongoDB?
db["tablename"].find().sort({"fieldname": -1}).limit(1)
Update (see below): we cannot simply do it like this because we need to convert IDs to long values (int is too small) to have a numeric comparison, not a string one.
I created the class Identifier
at https://github.com/Nelly-Barret/BETTER-fairificator/pull/46
It is not possible to sort simply sort the resources by their identifier because their identifiers are strings for now, thus we may have this:
better_default> db["ExaminationRecord"].aggregate([{'$project': {'identifier.value': 1}}, {'$sort': {'identifier.value': -1}}, {'$limit': 4}])
[
{
_id: ObjectId('666ac699db796b91f974f07e'),
identifier: { value: 'ExaminationRecord/99' }
},
{
_id: ObjectId('666ac699db796b91f974f0be'),
identifier: { value: 'ExaminationRecord/412' }
},
{
_id: ObjectId('666ac699db796b91f974f0bd'),
identifier: { value: 'ExaminationRecord/411' }
},
{
_id: ObjectId('666ac699db796b91f974f0bc'),
identifier: { value: 'ExaminationRecord/410' }
}
]
It seems that there are two options:
1
, 2
, 3
, etc, and use the resource type to build a unique URI out of it when neededIt seems that option 2. is better because (i) we do not have to compute the URI of the resource whenever we need to refer to it, and (ii) the min/max in the ETL will be used only once when the ETL script is called (and the catalogue will implement its own min/max and other aggregation functions)
With a Long-converted value, I can get the correct max from the db:
better_default> db["ExaminationRecord"].aggregate([
{"$project": { "identifier.value": { "$split": ["$identifier.value", "/"]}}},
{"$unwind": "$identifier.value"},
{"$match": {"identifier.value": /[0-9]+/}},
{"$group": {"_id": "identifier.value", "Max": {"$max": { "$toLong": "$identifier.value" }}}}
])
returns
[ { _id: 'identifier.value', Max: Long('412') } ]
(and not 99) as in the previous post
I implemented the usage of the current max id in the database. I can still see a bug: several ExaminationRecord instances have the same identifier. This may be because I have to increment the counter by 1 after set it to the maximum, otherwise we start again at the max value (thus creating duplicate identifiers).
better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
{
_id: ObjectId('666b105bdb796b91f9750db1'),
basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/16/2024, 17:29:31',
resourceType: 'ExaminationRecord',
value: 374
},
{
_id: ObjectId('666b107edb796b91f9750ec5'),
basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/18/2024, 17:30:06',
resourceType: 'ExaminationRecord',
value: '0,32'
},
{
_id: ObjectId('666b1094db796b91f9750fd9'),
basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/19/2024, 17:30:28',
resourceType: 'ExaminationRecord',
value: 242
}
]
See also the Sample identifier which is not properly set (it should be a dict with a value). See issue https://github.com/Nelly-Barret/BETTER-fairificator/issues/48
Update: The bug was not here, the problem was that I was computing the max ID but I forgot to set the counter to this value 👹
To easily check wether there are duplicate identifiers, we can check for duplicates like this: (https://stackoverflow.com/questions/14770170/how-to-find-mongo-documents-with-a-same-field):
db.mycollection.aggregate(
{ $group: {
// Group by fields to match on (a,b)
_id: { a: "$a", b: "$b" },
// Count number of matching docs for the group
count: { $sum: 1 },
// Save the _id for matching docs
docs: { $push: "$_id" }
}},
// Limit results to duplicates (more than 1 match)
{ $match: {
count: { $gt : 1 }
}}
)
The query to check duplicates returns nothing:
db.mycollection.aggregate({ "$group": { _id: { "identifier": "$identifier.value" }, count: {$sum: 1}, docs: {$push: "$_id"}}}, {$match: {count: { $gt: 1}}})
However, there ARE duplicates:
better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/99"})
[
{
_id: ObjectId('666bebd1cf94003d0a36a619'),
basedOn: { reference: 'Sample/20LD042587', type: 'Sample' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
subject: { reference: 'Patient/4.2165648176126e+18', type: 'Patient' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/16/2024, 09:05:53',
resourceType: 'ExaminationRecord',
value: 374
},
{
_id: ObjectId('666beccccf94003d0a36a762'),
subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/23/2024, 09:10:04',
resourceType: 'ExaminationRecord',
value: '0,32'
},
{
_id: ObjectId('666becdbcf94003d0a36a874'),
subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
identifier: { value: 'ExaminationRecord/99' },
insertedAt: '06/16/2024, 09:10:19',
resourceType: 'ExaminationRecord',
value: 242
}
]
I found the suitable query (https://stackoverflow.com/a/26985011):
db["ExaminationRecord"].aggregate([
{ "$group": { "_id": "$identifier", "count": { "$sum": 1 } } },
{ "$match": { "_id": { "$ne": null }, "count": { "$gt": 1 } } },
{ "$project": { "name": "$_id", "_id": 0 } }
])
It does identify the duplicates when the identiier is simple (a str) because it returns several lines, including the ExaminationRecord/99
. However, it returns nothing with the complex identifier ({ "identifier": { "value" : "A/1" }}
); so I don't know this is whether I fixed the resource counter or the query does not find duplicates for complex variables...
1
. In that case, the query indeed reports that ExaminationRecord/1
is a duplicate. So, we should be god here, i.e., we should be able to identify duplicates if they exist.The solution with the Counter (https://stackoverflow.com/questions/1045344/how-do-you-create-an-incremental-id-in-a-python-class) works well when running the code, but I have troubles in tests (probably because this is a static integer, thus is not reset appropriately).
Instead of clearing the Counter before each test (which would be a pain), I prefer to make it non-static (and this will also simply our lives if we develop further scripts/web pages, etc).
I may follow the model used in ConnectionLens:
RelationalGraph
, which has a nodeIdcounter
variable (non static)nodeIdCounter
when parsing the file and generating the nodes. Similarly, I could:
Done and that woks well!!
Merged at https://github.com/Nelly-Barret/BETTER-fairificator/commit/ac52259ec23e9642970b86047bd1b570d91d1847