Nelly-Barret / BETTER-fairificator

The fairification tools for BETTER project.
https://www.better-health-project.eu/
0 stars 0 forks source link

Cast int and float values #49

Closed Nelly-Barret closed 1 week ago

Nelly-Barret commented 2 weeks ago

Seen today while investigating #41:

{
    _id: ObjectId('666beccccf94003d0a36a762'),
    subject: { reference: 'Patient/6.73808832627831e+18', type: 'Patient' },
    instantiate: { reference: { value: 'Examination/6' }, type: 'Examination' },
    recordedBy: { reference: { value: 'Hospital/1' }, type: 'Hospital' },
    basedOn: { reference: 'Sample/20LD811192', type: 'Sample' },
    identifier: { value: 'ExaminationRecord/99' },
    insertedAt: '06/23/2024, 09:10:04',
    resourceType: 'ExaminationRecord',
    value: '0,32'
  },

Values with commas are quoted to be correctly read from the CSV. However, they should be converted to int/float values when inserted in the database.

Nelly-Barret commented 2 weeks ago

A good test would be to list all non-int/float/boolean values to see "what remains as strings"

Nelly-Barret commented 1 week ago

To cast string to int/float values, we cannot simply do:

try:
    return float(my_value)
except Exception:
    return my_value

because it will not process correctly numbers which are not written using the 🇬🇧 convention, i.e., with a . to separate decimals and a , to separate thoushands.

Instead, we need to use a locale, set to the origin country of the data, e.g., 🇮🇹 for Buzzi, 🇪🇸 for lafe, etc...

I have added the local positioning within the ETL script. I also defined the locale of each medical center; this may be overriden to use the 🇬🇧 convention with the parameter --use_en_locale=True

Nelly-Barret commented 1 week ago

Merged at https://github.com/Nelly-Barret/BETTER-fairificator/commit/30f15a02134ae505e38f4bb9a9a6d0c35188b4c7