Open nicholas-masel opened 2 months ago
@mstackhouse Are you aware or able to check with Sam or Lex to confirm the test data for ndjson is valid?
@nicholas-masel are you talking about the row-level data itself? So for the data records, or for the variable level metadata? Because this is the same case for the non-NDJSON data too:
From here
{
"datasetJSONCreationDateTime": "2023-06-28T15:38:43",
"datasetJSONVersion": "1.1.0",
"fileOID": "www.sponsor.xyz.org.project123.final",
"dbLastModifiedDateTime": "2023-05-31T00:00:00",
"originator": "Sponsor XYZ",
"sourceSystem": {
"name": "Software ABC",
"version": "1.0.0"
},
"studyOID": "cdisc.com.CDISCPILOT01",
"metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
"metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
"itemGroupOID": "IG.DM",
"isReferenceData": false,
"records": 18,
"name": "DM",
"label": "Demographics",
"columns": [
{"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
{"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
{"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
{"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
{"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
],
"rows": [
[1, "CDISCPILOT01", "DM", "CDISC001", 84],
[2, "CDISCPILOT01", "DM", "CDISC002", 76],
[3, "CDISCPILOT01", "DM", "CDISC003", 61],
...
]
}
The only change for NDJSON is that the rows elements are instead there own lines of the file:
{
"datasetJSONCreationDateTime": "2023-06-28T15:38:43",
"datasetJSONVersion": "1.1.0",
"fileOID": "www.sponsor.xyz.org.project123.final",
"dbLastModifiedDateTime": "2023-05-31T00:00:00",
"originator": "Sponsor XYZ",
"sourceSystem": {
"name": "Software ABC",
"version": "1.0.0"
},
"studyOID": "cdisc.com.CDISCPILOT01",
"metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.SDTM.1.7",
"metaDataRef": "https://metadata.location.org/CDISCPILOT01/define.xml",
"itemGroupOID": "IG.DM",
"isReferenceData": false,
"records": 18,
"name": "DM",
"label": "Demographics",
"columns": [
{"itemOID": "ITEMGROUPDATASEQ", "name": "ITEMGROUPDATASEQ", "label": "Record Identifier", "dataType": "integer"},
{"itemOID": "IT.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12, "keySequence": 1},
{"itemOID": "IT.DOMAIN", "name": "DOMAIN", "label": "Domain Abbreviation", "dataType": "string", "length": 2},
{"itemOID": "IT.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 8, "keySequence": 2},
{"itemOID": "IT.AGE", "name": "AGE", "label": "Age", "dataType": "integer"}
]
}
[1, "CDISCPILOT01", "DM", "CDISC001", 84]
[2, "CDISCPILOT01", "DM", "CDISC002", 76]
[3, "CDISCPILOT01", "DM", "CDISC003", 61]
...
Yeah, I was talking about variable names on the row-level data. I reached out to Sam and he confirmed this was not included due to file size.
I am trying out reading as a list instead of a df, and it seems to work, but is causing some other type issues downstream that didn't appear when reading this directly to a df.
yyjsonr::read_ndjson_str(
file,
type = "list",
nskip = 1,
opts = json_opts
)
The NDJSON data doesn't contain variable names in each row, only values.
For example:
With variables names:
{"name": "Leandro","lastName": "Shokida"} {"name": "Mariano","lastName": "De Achaval"}
Without variable names:
{"Leandro", "Shokida"} {"Mariano", "De Achaval"}
From what I can tell we can: