Nelly-Barret / BETTER-fairificator

The fairification tools for BETTER project.
https://www.better-health-project.eu/
0 stars 0 forks source link

Date conversion #58

Closed Nelly-Barret closed 1 week ago

Nelly-Barret commented 1 week ago

Following https://github.com/Nelly-Barret/BETTER-fairificator/issues/49, I am thinking of converting dates to "objects", so that Mongo can sort it correctly (instead of by alphabetical order).

To be investigated.

Nelly-Barret commented 1 week ago

I create a sample table with four dates. We can see that the sort operation (ascending and descending) does it by reflecting the (ascending or descneding) order of the string, not of the date:

test_dates> db["mytable"].find({}).sort({"mydate": 1})
[
  { _id: ObjectId('6670336ccbf3607fe2264bd3'), mydate: '01/10/2022' },
  { _id: ObjectId('66703333cbf3607fe2264bd2'), mydate: '11/10/2025' },
  { _id: ObjectId('6670331acbf3607fe2264bd0'), mydate: '21/09/2024' },
  { _id: ObjectId('6670331fcbf3607fe2264bd1'), mydate: '21/09/2025' }
]
test_dates> db["mytable"].find({}).sort({"mydate": -1})
[
  { _id: ObjectId('6670331fcbf3607fe2264bd1'), mydate: '21/09/2025' },
  { _id: ObjectId('6670331acbf3607fe2264bd0'), mydate: '21/09/2024' },
  { _id: ObjectId('66703333cbf3607fe2264bd2'), mydate: '11/10/2025' },
  { _id: ObjectId('6670336ccbf3607fe2264bd3'), mydate: '01/10/2022' }
]
Nelly-Barret commented 1 week ago

I found that I need to use Date objects to be able to sort by date (see https://www.geeksforgeeks.org/how-to-sort-a-collection-by-date-in-mongodb/)

Dates are in US format! To be set within MongoDB with the ETL locale

db["ttable"].insertOne({"mydate": new Date("09/21/2024")})
db["ttable"].insertOne({"mydate": new Date("09/21/2025")})
db["ttable"].insertOne({"mydate": new Date("10/11/2025")})
db["ttable"].insertOne({"mydate": new Date("10/01/2022")})

And an descending sort leads to:

test_dates> db["ttable"].find({}).sort({"mydate": 1})
[
  {
    _id: ObjectId('66703601cbf3607fe2264be0'),
    mydate: ISODate('2022-09-30T22:00:00.000Z')
  },
  {
    _id: ObjectId('667035e0cbf3607fe2264bdd'),
    mydate: ISODate('2024-09-20T22:00:00.000Z')
  },
  {
    _id: ObjectId('667035f1cbf3607fe2264bde'),
    mydate: ISODate('2025-09-20T22:00:00.000Z')
  },
  {
    _id: ObjectId('667035f9cbf3607fe2264bdf'),
    mydate: ISODate('2025-10-10T22:00:00.000Z')
  }
]

And a descending sort leads to:

test_dates> db["ttable"].find({}).sort({"mydate": -1})
[
  {
    _id: ObjectId('667035f9cbf3607fe2264bdf'),
    mydate: ISODate('2025-10-10T22:00:00.000Z')
  },
  {
    _id: ObjectId('667035f1cbf3607fe2264bde'),
    mydate: ISODate('2025-09-20T22:00:00.000Z')
  },
  {
    _id: ObjectId('667035e0cbf3607fe2264bdd'),
    mydate: ISODate('2024-09-20T22:00:00.000Z')
  },
  {
    _id: ObjectId('66703601cbf3607fe2264be0'),
    mydate: ISODate('2022-09-30T22:00:00.000Z')
  }
]

Note that a sort with no specified column will sort on _id.

Nelly-Barret commented 1 week ago

Apaprently, we can simply use an instance of datetime with PyMongo to be sure that the date is a date object and not a simple string.

Source: https://stackoverflow.com/questions/7651064/create-an-isodate-with-pymongo

Nelly-Barret commented 1 week ago

The problem of using strptime (from datetime module) is that it needs to initial date format (such as %Y-%M-%D) to know how to convert it 😞

https://www.freecodecamp.org/news/python-string-to-datetime-how-to-convert-an-str-to-a-date-time-with-strptime/

Let's see how pymongo manages datetime objects created from the ETL (without speciying a certain format, which we don't know in advance).

Nelly-Barret commented 1 week ago

From Mongosh, one can easily create a Date field with:

db["ttable"].insertOne({"mydate": new Date("10/11/2025")})

But, in our ETL, it is more complex because we use an intermediary JSON file, which does not recognize the new Date() constructor when it is not in a string (and treats it as a string and not as a date if it is in a string).

https://stackoverflow.com/a/13338551 says that we can write the date "the same way MongoDB does internally", so that it can be recognized as a Date:

{
   "S": "someString",
   "N": 123,
   "F": 12.3,
   "D": {"$date": 1352540684243}
}

where D contains the number of seconds elapsed since the initial UTC time.

For an easier reading, we can add some sugar around this and obtain a MongoDB-like date within a string:

import json, dateutil.parser, bson.json_util

a = """{
    "mydate": {"$isodate": "2012-11-01T20:19:55.782Z"}
}"""

def my_hook(dct):
    if '$isodate' in dct:
        return dateutil.parser.parse(dct['$isodate'])
    return bson.json_util.object_hook(dct)

obj = json.loads(a, object_hook=my_hook)

To be checked: does the sort still recognize does strings as dates?

From https://stackoverflow.com/a/53244694:

It seems that if we use:

{
  "createdAt": { "$date": "2018-11-10T22:26:12.111Z" }
}

This is internally recognized by MongoDB. Let's see!

Nelly-Barret commented 1 week ago

Okay, good! 🥳

Starting from this JSON sample:

[{
    "name": "Alice",
    "born": {"$date": "2018-11-10T22:26:12.111Z"}
}, {
    "name": "Bob", 
    "born": {"$date": "2019-12-04T21:10:14.111Z"}
}, {
    "name": "Carole", 
    "born": {"$date": "2017-03-05T22:26:12.111Z"}
}, {
    "name": "David", 
    "born": {"$date": "2019-12-04T21:07:07.111Z"}
}]

I could insrt it within MongoDB with:

❯ mongoimport -d mydb -c mycollection --jsonArray --type json BETTER-fairificator/datasets/test_dates/data.json
2024-06-17T17:21:01.651+0200    connected to: mongodb://localhost/
2024-06-17T17:21:01.759+0200    4 document(s) imported successfully. 0 document(s) failed to import.

And then the imported data contains the ISO dates as expected:

mydb> db["mycollection"].find()
[
  {
    _id: ObjectId('6670545dba913a2fda4b4009'),
    name: 'Carole',
    born: ISODate('2017-03-05T22:26:12.111Z')
  },
  {
    _id: ObjectId('6670545dba913a2fda4b400a'),
    name: 'David',
    born: ISODate('2019-12-04T21:07:07.111Z')
  },
  {
    _id: ObjectId('6670545dba913a2fda4b400b'),
    name: 'Alice',
    born: ISODate('2018-11-10T22:26:12.111Z')
  },
  {
    _id: ObjectId('6670545dba913a2fda4b400c'),
    name: 'Bob',
    born: ISODate('2019-12-04T21:10:14.111Z')
  }
]

So I can do the same in the ETL!

Nelly-Barret commented 1 week ago

Almost there; I think the ISO format that I use in Python is not good (missing the Z at the end), this is why MongoDB does not recognise it well:

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/823"})
[
  {
    _id: ObjectId('6670598e96d8fcc097255897'),
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/72' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    identifier: { value: 'ExaminationRecord/823' },
    insertedAt: { '$date': '2024-06-26T17:43:10.269558' },
    resourceType: 'ExaminationRecord',
    value: { '$date': '2021-12-15T00:00:00' }
  }
]
Nelly-Barret commented 1 week ago

It is known that Python datetime does not include timezone information, thus the missing Z at the end o the string (stating that we are talking about UTC time): https://stackoverflow.com/questions/19654578/python-utc-datetime-objects-iso-format-doesnt-include-z-zulu-or-zero-offset

  1. One workaround is to append a Z at the end of the string
  2. Or to use strftime but this requires to specify the initial date format.

We can use 2. because datetime always format the string in the same format (the ISO one).

Nelly-Barret commented 1 week ago

I coded option 2 (see above) but I still have the { "$date": "..." } instead of an ISODate when looking at the inserted instances:

better_batch> db["ExaminationRecord"].find({"identifier.value": "ExaminationRecord/823"})
[
  {
    _id: ObjectId('6671457a96d8fcc097257a93'),
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' },
    instantiate: { reference: { value: 'Examination/72' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    identifier: { value: 'ExaminationRecord/823' },
    insertedAt: { '$date': '2024-06-21T10:29:46Z' },
    resourceType: 'ExaminationRecord',
    value: { '$date': '2021-12-15T00:00:00Z' }
  }
]

One of the (test) resource which could be inserted successfully as an ISODate looked like this:

{
    "name": "David", 
    "born": {"$date": "2019-12-04T21:07:07.111Z"}
}

If I try to insert manually the above ExaminationRecord resource instance, the date is not interpreted correctly. Similarly, if I insert within mongosh one JSON object of the test file (with the four people and that I wrote above), the date is not interpreted:

mydb> db["mycollection"].insertOne({
...     "name": "Alice",
...     "born": {"$date": "2018-11-10T22:26:12.111Z"}
... })

  {
    _id: ObjectId('667147511235ad6658c836f1'),
    name: 'Alice',
    born: { '$date': '2018-11-10T22:26:12.111Z' }
  }

Therefore, this is probably the mongo import which interprets the dates

Nelly-Barret commented 1 week ago

If I import the JSON files with mongoimport in a new table, the dates are correctly interpreted as dates 🤔

[
  {
    _id: ObjectId('66714ddadb0fabbbff1ddde9'),
    identifier: { value: 'ExaminationRecord/823' },
    resourceType: 'ExaminationRecord',
    value: ISODate('2021-12-15T00:00:00.000Z'),
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    instantiate: { reference: { value: 'Examination/72' }, type: 'Examination' },
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' }
  }
]

But if I insert it by hand, it is not recognised as an ISODate 👹

[
  {
    _id: ObjectId('66714e9e1235ad6658c836f6'),
    identifier: { value: 'ExaminationRecord/823' },
    resourceType: 'ExaminationRecord',
    value: { '$date': '2021-12-15T00:00:00Z' },
    subject: { reference: 'Patient/-5.74936744641016e+18', type: 'Patient' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    instantiate: { reference: { value: 'Examination/72' }, type: 'Examination' },
    basedOn: { reference: 'Sample/20LD811196', type: 'Sample' }
  }
]
Nelly-Barret commented 1 week ago

If I insert one tuple with datetime format from Python (without writing it to json), it works:

>>> from pymongo.mongo_client import MongoClient
>>> import datetime
>>> d = datetime.datetime.strptime("2017-10-13T10:53:53.000Z", "%Y-%m-%dT%H:%M:%S.000Z")
>>> with MongoClient() as mongo:
...   db = mongo.get_database("test")
...   db['dates'].insert_one({"date" : d})
...
InsertOneResult(ObjectId('6671504f23027456542e0a51'), acknowledged=True)

Shows:

test> db["dates"].find({})
[
  {
    _id: ObjectId('6671504f23027456542e0a51'),
    date: ISODate('2017-10-13T10:53:53.000Z')
  }
]
Nelly-Barret commented 1 week ago

But this:

from pymongo.mongo_client import MongoClient
import datetime
import json
d = datetime.datetime.strptime("2017-10-13T10:53:53.000Z", "%Y-%m-%dT%H:%M:%S.000Z")
with MongoClient() as mongo:
  db = mongo.get_database("test")
  with open("myfile.json", "w") as data_file:
    json.dump([{"date" : d.isoformat() + "Z"}], data_file)
  with open("myfile.json", "r") as data_file:
    tuples = json.load(data_file)
    db["dates"].insert_many(tuples, ordered=False)

Shows:

[
  {
    _id: ObjectId('66715311a10589c0e3b99d2a'),
    date: '2017-10-13T10:53:53Z'
  }
]
Nelly-Barret commented 1 week ago

If we insert (from Python, but without a JSON file) my_tuple = {"date" : d} that works, but if we insert my_tuple = {"date" : d.isoformat() + "Z"} that does not work

Nelly-Barret commented 1 week ago

In this post https://www.mongodb.com/community/forums/t/inserting-field-with-date-datetype-is-working-with-mongoimport-and-not-with-insert-one-in-pymongo/205915/2, the workaround is to insert JSON documents directly from Python and do not write them in JSON files

Nelly-Barret commented 1 week ago

As a very last resort, I could transform strings into dates when asking something to MongoDB using the operator dateFromString: https://www.mongodb.com/docs/manual/reference/operator/aggregation/dateFromString/

Nelly-Barret commented 1 week ago

This person ingests his data with mongoimport directly in JS... This works, but that's ugly and does not feel safe: https://stackoverflow.com/questions/44198126/insert-a-date-object-from-json-file-in-mongodb

Nelly-Barret commented 1 week ago

A single post to StackOverflow and a guy gave me the answer: https://stackoverflow.com/questions/78637529/isodates-in-json-file-for-mongodb

We need to use bson instead of json.

That works!

Nelly-Barret commented 1 week ago

Merged with https://github.com/Nelly-Barret/BETTER-fairificator/pull/63