ImagingDataCommons / ETL

(CORE REPO)
Apache License 2.0
0 stars 1 forks source link

Inconsistencies identified for the ISPY1 clinical table #38

Open fedorov opened 2 years ago

fedorov commented 2 years ago

I see ispy1_clinical race_id options all have option_description set to None.

image

The source table has actual race for the option descriptions.

image

Looks like a bug in the parser?

fedorov commented 2 years ago

It could be that the parser expects second column to only contain options, but in this collection that column contains a row with the column description, followed by options. That column description is not propagated into the column metadata either.

There are similar problems with parsing value option descriptions for other columns:

image

image

G-White-ISB commented 2 years ago

Unfortunately we are not yet using this dictionary. Except for the ACRIN collections none of the dictionaries in other collections have a standard format. They require custom parsers or maybe some ML method to 'discover' the format but that is not in my wheel house. Fortunately most of the collections do not have/need a dictionary. So far I have custom dictionary parsers for lidc, ispy2, and hcc_tace. When I'm not using a dictionary I obtain 'option_codes' by reading the unique values in the clinical tables.

fedorov commented 2 years ago

Thanks for the explanation. As we do have custom parsers for few other collections, I think it does make sense to add one for ISPY1 - it is an important collection. I understand it is extra work, but I think it is worth it. We could even define some JSON format for the dictionaries, and copy-paste content manually whenever that is more practical than writing a parser. But this can be addressed after v11.

So far I [only] have custom dictionary parsers for lidc, ispy2, and hcc_tace.

This should definitely be mentioned in the data release notes for v11!

G-White-ISB commented 2 years ago

I've added the information from dictionaries for ISPY1 and COVID-19-NY-SBU. This covers all dictionaries for the collections we currently have