ImagingDataCommons / ETL

(CORE REPO)
Apache License 2.0
0 stars 1 forks source link

[clinical] Support NLST clinical data #28

Closed fedorov closed 1 year ago

fedorov commented 2 years ago

Currently, it is captured in the nlst_clinical_* tables, with the dictionaries in RTF format linked here: https://learn.canceridc.dev/data/organization-of-data/files-and-metadata#nlst. I don't know if they are already handled, but I did not find the dictionaries/tables in idc-dev-etl:clinical.

G-White-ISB commented 2 years ago

I have not started on parsing the RTF files

G-White-ISB commented 2 years ago

This issue is still outstanding

fedorov commented 1 year ago

@G-White-ISB I made Excel spreadsheets for each of the dictionaries, see here:

NLST_dicts_xls.zip

The original RTF for convenience are here:

nlst780.idc.delivery.052821.zip

Can you please review and let me know if you would organize it differently or it is ok? If ok, I will send to TCIA and ask them to post on the wiki so you can ingest them from there and incorporate into your workflow. Would it be possible to add this to v15?

G-White-ISB commented 1 year ago

I can definitely work with these Excell files. We'll get this in for v15. I don't know if we need to bother TCIA. After the release I'll see if I can do the rtf to Excel conversion programaticaly

fedorov commented 1 year ago

I don't know if we need to bother TCIA.

I do! If I put effort into this, and I believe it can help someone, I want it to be available, and ideally at a central place. I will take care of this.

After the release I'll see if I can do the rtf to Excel conversion programaticaly

I do not think this is worth the effort. I don't think we can expect those files to update dynamically, it is not a common representation, so we do it and forget about it until the next time (if the next time ever comes).

fedorov commented 1 year ago

@G-White-ISB I was reviewing this, and I have troubles understanding the BQ content.

I selected column metadata using this query:

SELECT
  *
FROM
  `bigquery-public-data.idc_v15_clinical.column_metadata`
WHERE
  collection_id="nlst"
  AND table_name="bigquery-public-data.idc_v15_clinical.nlst_prsn"
ORDER BY
  column_label
G-White-ISB commented 1 year ago

The source DATA for nlst_prsn is all in ONE CSV file with all 30 + columns. The accompanying RTF document, which was used to create the Excel spreadsheet, explains different sets of columns on different pages.

Some columns in the dictionary were missed because the column name is not literally in the dictionary. Columns scr_iso1, scr_iso2, scr_iso2 are apparently covered by scr_iso0-2 in the dictionary.

fedorov commented 1 year ago

This needs to be addressed in the custom parsing script. The dictionary should contain actual values for meaning/labels.

G-White-ISB commented 1 year ago

The column_metadata table in the pdp_staging dataset has been updated to include the column labels and options for scr_iso0.. scr_iso2 columns and scr_days0 ..scr_days2 columns as parsed from the dictionary. The table still needs to be updated in the public dataset.