Open fedorov opened 1 year ago
TL;DR: After reviewing this, thinking about possible solutions, and discussing with @vkt1414, I decided to revert the updates to the queries done in #64 and proceeding with v13.
The original queries that correspond to bigquery-public-data.idc_current.qualitative_measurements
and bigquery-public-data.idc_current.quantitative_measurements
were written to flatten the content of TID 1500 SRs that we had at the time. In all of those SRs, both quantitative and qualitative measurements were accompanying image regions defined by segmentations, such as the example below (section of the output of dsrdump
for the file in gs://idc-dev-open/c5dd463f-7740-47da-80d3-e6114904e5c3.dcm
):
<contains CONTAINER:(,,"Imaging Measurements")=SEPARATE>
<contains CONTAINER:(,,"Measurement Group")=SEPARATE>
<has obs context TEXT:(,,"Activity Session")="1">
<has obs context TEXT:(,,"Tracking Identifier")="Nodule 1">
<has obs context UIDREF:(,,"Tracking Unique Identifier")="2.25.84572801268285922663419591960434030454640929448094786485074">
<contains CODE:(,,"Finding")=(M-03010,SRT,"Nodule")>
<has obs context TEXT:(,,"Time Point")="1">
<contains IMAGE:(,,"Referenced Segment")=(SG image,,1)>
<contains UIDREF:(,,"Source series for segmentation")="1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192">
<has concept mod CODE:(,,"Finding Site")=(T-28000,SRT,"Lung")>
<contains NUM:(,,"Volume")="6.594475E+03" (mm3,UCUM,"cubic millimeter")>
<has concept mod TEXT:(,,"Algorithm Name")="pylidc">
<has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
<contains NUM:(,,"Diameter")="3.195933E+01" (mm,UCUM,"millimeter")>
<has concept mod TEXT:(,,"Algorithm Name")="pylidc">
<has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
<contains NUM:(,,"Surface area of mesh")="2.392704E+03" (mm2,UCUM,"square millimeter")>
<has concept mod TEXT:(,,"Algorithm Name")="pylidc">
<has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
<contains CODE:(,,"Subtlety score")=(105,99LIDCQIICR,"5 out of 5 (Obvious)")>
<contains CODE:(,,"Internal structure")=(C12471,NCIt,"Soft tissue")>
For all of those SRs, the following assumptions were valid:
NUM
content items with the quantitative measurementsCODE
content items with the qualitative measurements/assessments, none of which uses "Finding" or "Finding site" concepts, and so can be distinguished from the items in 1 above.With those assumptions, the result of "flattening" was the following table schema (for the qualitative measurements):
Now, the new dataset has SRs that:
<contains CONTAINER:(,,"Measurement Group")=CONTINUOUS>
<has obs context TEXT:(,,"Tracking Identifier")="Annotations group 162">
<has obs context UIDREF:(,,"Tracking Unique Identifier")="1.2.826.0.1.3680043.8.498.20536057271431471310083689490807745912">
<has concept mod CODE:(,,"Finding Site")=(45048000,SCT,"Neck")>
<contains IMAGE:(,,"Source")=(CT image,)>
It is not clear to me what one would want to have as expected behavior flattening those measurement groups above into the schema of the table we established, or if this would make any sense at all.
We could, arguably, put "Finding site" into the findingSite
column, and have one row for each "Finding site" content item. But, in my opinion, this would be confusing, since the actual values in the Quantity/Value columns would have to be either replicating the findingSite
column, or be left blank. And the query would be getting quite complex, since we would probably need to detect measurement groups that are not accompanying segmentations, and process those differently. Yet another idea would be to use a concept different from "Finding site" for those annotations (I was reluctant to use that concept from the start, as noted in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1218#issuecomment-1372254742, anticipating problems due to the clash of the concept).
Alternatively, we could have a completely separate query that would handle evaluations that are not derived from segmentations. I think this would be easier to understand for the user. I think for v13 we should do just that, and use that query in the notebooks and other materials accompanying the new nnU-Net-BPR-annotations
collection.
The issues below were reported by @deepakri201 via discord. Need to investigate.
Using this query: https://github.com/ImagingDataCommons/etl_flow/pull/69. I think similar issues are occurring as before, when a slice has more than one body part region assigned to it, or more than one landmark assigned to it. For example:
For the case of multiple regions per slice -- If we take PatientID="LUNG1-002", and check where trackingIdentifier="Annotations group 14", we should only get 2 rows corresponding to Abdomen and Chest regions, but we get 4 rows.
For the case of multiple landmarks per slice -- If we take PatientID="LUNG1-001", and check where trackingIdentifier="Annotations group landmarks 1" , we should only get 2 rows corresponding to Kidney + Bottom, and L2 vertebra + Center, but we get 4 rows.
However, using the query here: https://github.com/vkt1414/etl_flow/blob/dde527d1e3ad85fcabe3571a66468f69c387a033/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/qualitative_measurements.sql, the regions and landmarks are correct. Andrey, I think you may have worked from a slightly older version of Vamsi's query where he fixed these problems. (edited)