FINNGEN / kanta_lab_preprocessing

Repo for kanta lab QC
MIT License
1 stars 0 forks source link

Data columns #5

Open piotor87 opened 3 months ago

piotor87 commented 3 months ago

In the config file we specify which columns to include, how to rename them and what the output names should be.

Some of them are simple one to one mapping, while others need to be manipulated and possible combined (e.g. lab IDs).

As per README my current understanding of what the output should be formatted like is as follows:

# Column Name Easy Description General Notes Technical Notes
1 FINREGISTRYID Pseudoanonimized IDs
2 LAB_DATE_TIME Date and time of lab measurement
3 LAB_SERVICE_PROVIDER Service provider This is NOT the lab performing the test but the service provider that ordered the test see Section Important Notes for more details. The original data contains uses OIDs (OID-yksilöintitunnukset). These were mapped to a readable string based on the city where the service provider is registered i.e. HUS is mapped to Helsinki_1301. The map is in the data folder and based on THL - SOTE-organisaatiorekisteri 2008.
4 LAB_ID National (THL) or local lab ID of the measurement
5 LAB_ID_SOURCE Source of the lab ID 0: local and 1: national (THL)
6 LAB_ABBREVIATION Laboratory abbreviation of the measurement from the data (local) or mapped using the THL map (national) The map for the national (THL) IDs is in the data folder, and was downloaded from Kuntaliitto - Laboratoriotutkimusnimikkeistö
7 LAB_VALUE The value of the laboratory measurement
8 LAB_UNIT The unit of the labroatroy measurement from the data
9 LAB_ABNORMALITY Abnormality of the lab measurement Describes whether the test is result is normal or abnormal i.e. too high or low low based on the laboratories reference values. This is not a quality control variable but to state it simply and inaccurately denotes whether the patient is healthy or not. See AR/LABRA - Poikkeustilanneviestit for the abbreviations meanings. The column contains a lot of missingness.
10 MEASUREMENT_STATUS The measurement status The final data contains only C - corrected results or F - final result See Koodistopalvelu - AR/LABRA - Tutkimusvastauksien tulkintakoodit 1997
11 REFERENCE_VALUE_TEXT The reference values for the measurement in text form This can be used to define the lab abnormality with better coverage using regex expressions (-to be implemented for the whole data).

Some of them are direct mapping , while others LAB_ID,LAB_ID_SOURCE are generated via the mapping in the description, meaning the mapping is as follows:

Column in raw file Column in clean file Comment
potilashenkilotunnus FINREGISTRYID
tutkimusaika LAB_DATE_TIME
palvelutuottaja_organisaatio LAB_SERVICE_PROVIDER
paikallinentutkimusnimikeid,laboratoriotutkimusnimikeid LAB_ID
paikallinentutkimusnimikeid,laboratoriotutkimusnimikeid LAB_ID_SOURCE
paikallinentutkimusnimike (ONLY IF LOCAL) LAB_ABBREVIATION
tutkimustulosarvo LAB_VALUE
tutkimustulosyksikko LAB_UNIT
tuloksenpoikkeavuus LAB_ABNORMALITY
viitevaliteksti REFERENCE_VALUE_TEXT
tutkimusvastauksentila MEASUREMENT_STATUS

Also, hetu_root is needed as input as it allows us to filter out some values (!=1.2.246.21) as it means that they are manually assigned hetus

Open question: what other columns would be of interest? Assuming our data matches exactly Kira's in FinnRegistry, the format should be:

  1. laboratoriotutkimusoid - Laboratory test OID
  2. asiakirjaoid - Document OID
  3. merkintaoid - Note OID
  4. entryoid - Entry OID
  5. potilashenkilotunnus - Finregistry ID
  6. palvelutapahtumatunnus - Service event ID
  7. tutkimuksennaytelaatu - Measurement
  8. tutkimuksentekotapa - Test method
  9. potilassyntymaaika_pvm - Patient birth date
  10. potilas_sukupuoli - Patient sex
  11. labooratoriotutkimusoid - Laboratory test OID
  12. tutkimusaika - Test time
  13. alkuperainenasiakirjaoid - Original document OID
  14. asiakirjaversio - Document version
  15. rekisterinpitaja_organisaatio_h - Registry controller organisation ID
  16. rekisterinpitaja_h - Registry organisation ID
  17. asiakirjavalistilapk - Document status
  18. marittelykokoelmaoid - Collection OID
  19. tietojarjestelanimi - Data system name
  20. tietojarjestelavalmistaja - Data system manufacturer
  21. tietojarjestelaversio - Data system version
  22. asiajirjaluontiaika - Document creation time
  23. pal_alkuperainenasiakirjaoid - Original document OID
  24. pal_asiakirjaversio - Document version
  25. pal_ariakirjaoid - Document OID
  26. pal_asiakirjaversio - Document version
  27. rekisterinpitaja_organisaatio - Registry controller organisation ID
  28. rekisterinpitaja - Registry organisation ID
  29. palveluntuottaja_organisaatio - Service provider organisation ID
  30. palveluisanta_organisaatio - Service host organisation ID
  31. hetu_root - Finregistry ID root
  32. paikallinentutkimusnimike - Local test name
  33. paikallinentutkimusnimikeid - Local test name ID
  34. tutkimuskoodistonjarjestelmaid - Test code system ID
  35. tutkimuksenvastauksentila - Test result status
  36. tutkimustulosarvo - Test result value
  37. tutkimustulosyksikkö - Test result unit
  38. tuloksenpoikkeavuus - Test result abnormality
  39. tuloksenvalmistumisaika - Test result time
  40. viitearvoryhma - Reference value group
  41. viitevalialkuarvo - Reference value lower limit
  42. viitevalialkuarvoyksikko - Reference value lower limit unit
  43. viitevaliloppuarvo - Reference value upper limit
  44. viitevaliloppuarvoyksikko - Reference value upper limit unit
  45. viitearvoteksti - Reference value text
  46. erikoisalalyhenne - Speciality abbreviation
piotor87 commented 3 months ago

@vincent-octo this should sum up all we discussed. Feel free to tag others in the convo.

vincent-octo commented 3 months ago

This looks great, thanks! We will probably get some feedback on it in the coming weeks.