Open fedorov opened 2 years ago
It could be that the parser expects second column to only contain options, but in this collection that column contains a row with the column description, followed by options. That column description is not propagated into the column metadata either.
There are similar problems with parsing value option descriptions for other columns:
Unfortunately we are not yet using this dictionary. Except for the ACRIN collections none of the dictionaries in other collections have a standard format. They require custom parsers or maybe some ML method to 'discover' the format but that is not in my wheel house. Fortunately most of the collections do not have/need a dictionary. So far I have custom dictionary parsers for lidc, ispy2, and hcc_tace. When I'm not using a dictionary I obtain 'option_codes' by reading the unique values in the clinical tables.
Thanks for the explanation. As we do have custom parsers for few other collections, I think it does make sense to add one for ISPY1 - it is an important collection. I understand it is extra work, but I think it is worth it. We could even define some JSON format for the dictionaries, and copy-paste content manually whenever that is more practical than writing a parser. But this can be addressed after v11.
So far I [only] have custom dictionary parsers for lidc, ispy2, and hcc_tace.
This should definitely be mentioned in the data release notes for v11!
I've added the information from dictionaries for ISPY1 and COVID-19-NY-SBU. This covers all dictionaries for the collections we currently have
I see
ispy1_clinical
race_id
options all haveoption_description
set to None.The source table has actual race for the option descriptions.
Looks like a bug in the parser?