NDJSON test data doesn't contain variable names

nicholas-masel commented 2 months ago

The NDJSON data doesn't contain variable names in each row, only values.

For example:

With variables names: {"name": "Leandro","lastName": "Shokida"} {"name": "Mariano","lastName": "De Achaval"}

Without variable names: {"Leandro", "Shokida"} {"Mariano", "De Achaval"}

From what I can tell we can:

Update the NDJSON test data so that each row contains variable names. We can then stream this directly to a data frame.
If this is intentional, we can read this in as a list of lists, bind the rows and convert to a data frame.

nicholas-masel commented 2 months ago

@mstackhouse Are you aware or able to check with Sam or Lex to confirm the test data for ndjson is valid?

mstackhouse commented 2 months ago

@nicholas-masel are you talking about the row-level data itself? So for the data records, or for the variable level metadata? Because this is the same case for the non-NDJSON data too:

From here

{
  "datasetJSONCreationDateTime": "2023-06-28T15:38:43",
  "datasetJSONVersion": "1.1.0",
  "fileOID": "www.sponsor.xyz.org.project123.final",
  "dbLastModifiedDateTime": "2023-05-31T00:00:00",
  "originator": "Sponsor XYZ",
  "sourceSystem": {
      "name": "Software ABC",
      "version": "1.0.0"
  },
  "studyOID": "cdisc.com.CDISCPILOT01",
  "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
  "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
  "itemGroupOID": "IG.DM",
  "isReferenceData": false,
  "records": 18,
  "name": "DM",
  "label": "Demographics",
  "columns": [
      {"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
      {"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
      {"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
      {"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
      {"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
  ],
  "rows": [
      [1, "CDISCPILOT01", "DM", "CDISC001", 84],
      [2, "CDISCPILOT01", "DM", "CDISC002", 76],
      [3, "CDISCPILOT01", "DM", "CDISC003", 61],
      ...
  ]
}

The only change for NDJSON is that the rows elements are instead there own lines of the file:

{
  "datasetJSONCreationDateTime": "2023-06-28T15:38:43",
  "datasetJSONVersion": "1.1.0",
  "fileOID": "www.sponsor.xyz.org.project123.final",
  "dbLastModifiedDateTime": "2023-05-31T00:00:00",
  "originator": "Sponsor XYZ",
  "sourceSystem": {
      "name": "Software ABC",
      "version": "1.0.0"
  },
  "studyOID": "cdisc.com.CDISCPILOT01",
  "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
  "metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
  "itemGroupOID": "IG.DM",
  "isReferenceData": false,
  "records": 18,
  "name": "DM",
  "label": "Demographics",
  "columns": [
      {"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
      {"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
      {"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
      {"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
      {"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
  ]
}
[1, "CDISCPILOT01", "DM", "CDISC001", 84]
[2, "CDISCPILOT01", "DM", "CDISC002", 76]
[3, "CDISCPILOT01", "DM", "CDISC003", 61]
...

nicholas-masel commented 2 months ago

Yeah, I was talking about variable names on the row-level data. I reached out to Sam and he confirmed this was not included due to file size.

I am trying out reading as a list instead of a df, and it seems to work, but is causing some other type issues downstream that didn't appear when reading this directly to a df.

yyjsonr::read_ndjson_str(
      file,
      type = "list",
      nskip = 1,
      opts = json_opts
    )

atorus-research / datasetjson

NDJSON test data doesn't contain variable names #49