UCL-Chimera / bellerophon

data flows
0 stars 0 forks source link

Define the CCHIC data format #25

Closed docsteveharris closed 1 year ago

docsteveharris commented 1 year ago

As per note from Sarah

Option1: GIANT CSV

            Patient  Visit  Age  Gender  Ethnicity  DateTime_measurement  Temp DateTime_meas         pH
                   1           1      80     M             W       2023-01-20 14:00                   37    2023-01-20 14:00       6
                   1           1      80     M             W       2023-01-21 14:00                    37   2023-01-23 12:00       5
                   1           1      80     M             W       2023-01-22 14:00                    37
                   2           1      60     F               B        2023-01-19 14:00                    37    2023-01-19 09:00       6

Option 2: CSV PER TYPE

      e.g. temperature.csv

              Patient  Visit  Age  Gender  Ethnicity  DateTime_measurement  Temperature
                   1           1      80     M             W             2023-01-20 14:00                    37
                   1           1      80     M             W             2023-01-21 14:00                    37
                   1           1      80     M             W             2023-01-22 14:00                    37
                   2           1      60     F               B               2023-01-19 14:00                    37

      pH.csv

            Patient  Visit  Age  Gender  Ethnicity  DateTime_measurement     pH
                   1           1      80     M             W        2023-01-20 14:00                   6
                   1           1      80     M             W        2023-01-23 12:00                   5
                   2           1      60     F               B           2023-01-19 09:00                  6
          ...

Option 3: FOLDER PER PATIENT WITH patient.csv Patient Visit Age Gender Ethnicity 1 1 80 M W temperature.csv

         DateTime_measurement  Temperature

            2023-01-20 14:00                    37
              2023-01-21 14:00                    37
               2023-01-22 14:00                    37
  pH.csv
           DateTime_measurement     pH
           2023-01-20 14:00                   6
             2023-01-23 12:00                   5
docsteveharris commented 1 year ago

I'd recommend a single giant file (since it saves the user having to coordinate and manage different files). Perhaps not CSV though. Most people I know seem to be using parquet now? It plays nicely with spark/r/python etc? Saves us tripping up over typing issues?

clairejblack commented 1 year ago

One big file better than separate ones, but ? One row per obs, rather than column per obs.

skeating commented 1 year ago

SO

 Meas_Type  Patient  Visit  Age  Gender  Ethnicity  DateTime_measurement   Value 
 Temp              1           1      80     M             W       2023-01-20 14:00                  37  
  Temp             1           1      80     M             W       2023-01-21 14:00                  37 
Temp               1           1      80     M             W       2023-01-22 14:00                  37
 pH                  1           1      80     M             W       2023-01-20 14:00                    6
  pH                 1           1      80     M             W       2023-01-23 12:00                    5
 Temp              2           1      60     F               B       2023-01-19 14:00                   37   
 pH                  2           1      60     F               B       2023-01-19 09:00                     6
docsteveharris commented 1 year ago

presumably we'll need the standard arrangement of a separate column for string/numeric/datetime measurements etc (as per visit_observation in emap or any of the OMOP tables

docsteveharris commented 1 year ago

closing; will plan to move a readme with the data when it arrives in the DSH