ImagingDataCommons / etl_flow

(CORE REPO)
Apache License 2.0
4 stars 3 forks source link

Duplicate entries in qualitative measurements table #70

Open fedorov opened 1 year ago

fedorov commented 1 year ago

The issues below were reported by @deepakri201 via discord. Need to investigate.

Using this query: https://github.com/ImagingDataCommons/etl_flow/pull/69. I think similar issues are occurring as before, when a slice has more than one body part region assigned to it, or more than one landmark assigned to it. For example:

  1. For the case of multiple regions per slice -- If we take PatientID="LUNG1-002", and check where trackingIdentifier="Annotations group 14", we should only get 2 rows corresponding to Abdomen and Chest regions, but we get 4 rows.

  2. For the case of multiple landmarks per slice -- If we take PatientID="LUNG1-001", and check where trackingIdentifier="Annotations group landmarks 1" , we should only get 2 rows corresponding to Kidney + Bottom, and L2 vertebra + Center, but we get 4 rows.

However, using the query here: https://github.com/vkt1414/etl_flow/blob/dde527d1e3ad85fcabe3571a66468f69c387a033/bq/derived_table_creation/BQ_Table_Building/derived_data_views/sql/qualitative_measurements.sql, the regions and landmarks are correct. Andrey, I think you may have worked from a slightly older version of Vamsi's query where he fixed these problems. (edited)

fedorov commented 1 year ago

TL;DR: After reviewing this, thinking about possible solutions, and discussing with @vkt1414, I decided to revert the updates to the queries done in #64 and proceeding with v13.


The original queries that correspond to bigquery-public-data.idc_current.qualitative_measurements and bigquery-public-data.idc_current.quantitative_measurements were written to flatten the content of TID 1500 SRs that we had at the time. In all of those SRs, both quantitative and qualitative measurements were accompanying image regions defined by segmentations, such as the example below (section of the output of dsrdump for the file in gs://idc-dev-open/c5dd463f-7740-47da-80d3-e6114904e5c3.dcm):

 <contains CONTAINER:(,,"Imaging Measurements")=SEPARATE>
    <contains CONTAINER:(,,"Measurement Group")=SEPARATE>
      <has obs context TEXT:(,,"Activity Session")="1">
      <has obs context TEXT:(,,"Tracking Identifier")="Nodule 1">
      <has obs context UIDREF:(,,"Tracking Unique Identifier")="2.25.84572801268285922663419591960434030454640929448094786485074">
      <contains CODE:(,,"Finding")=(M-03010,SRT,"Nodule")>
      <has obs context TEXT:(,,"Time Point")="1">
      <contains IMAGE:(,,"Referenced Segment")=(SG image,,1)>
      <contains UIDREF:(,,"Source series for segmentation")="1.3.6.1.4.1.14519.5.2.1.6279.6001.179049373636438705059720603192">
      <has concept mod CODE:(,,"Finding Site")=(T-28000,SRT,"Lung")>
      <contains NUM:(,,"Volume")="6.594475E+03" (mm3,UCUM,"cubic millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains NUM:(,,"Diameter")="3.195933E+01" (mm,UCUM,"millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains NUM:(,,"Surface area of mesh")="2.392704E+03" (mm2,UCUM,"square millimeter")>
        <has concept mod TEXT:(,,"Algorithm Name")="pylidc">
        <has concept mod TEXT:(,,"Algorithm Version")="0.2.0">
      <contains CODE:(,,"Subtlety score")=(105,99LIDCQIICR,"5 out of 5 (Obvious)")>
      <contains CODE:(,,"Internal structure")=(C12471,NCIt,"Soft tissue")>

For all of those SRs, the following assumptions were valid:

  1. each measurement group contains one and only "Finding" and "Finding site" concepts
  2. each measurement group contains one or more NUM content items with the quantitative measurements
  3. each measurement group contains one or more CODE content items with the qualitative measurements/assessments, none of which uses "Finding" or "Finding site" concepts, and so can be distinguished from the items in 1 above.

With those assumptions, the result of "flattening" was the following table schema (for the qualitative measurements):

image

Now, the new dataset has SRs that:

  1. Contain only qualitative measurements.
  2. Use "Finding site" concept to describe the actual qualitative assessment, and not the location of the segmented region, with multiple "Finding site" content items allowable within the same measurement group.
    <contains CONTAINER:(,,"Measurement Group")=CONTINUOUS>
      <has obs context TEXT:(,,"Tracking Identifier")="Annotations group 162">
      <has obs context UIDREF:(,,"Tracking Unique Identifier")="1.2.826.0.1.3680043.8.498.20536057271431471310083689490807745912">
      <has concept mod CODE:(,,"Finding Site")=(45048000,SCT,"Neck")>
      <contains IMAGE:(,,"Source")=(CT image,)>

It is not clear to me what one would want to have as expected behavior flattening those measurement groups above into the schema of the table we established, or if this would make any sense at all.

We could, arguably, put "Finding site" into the findingSite column, and have one row for each "Finding site" content item. But, in my opinion, this would be confusing, since the actual values in the Quantity/Value columns would have to be either replicating the findingSite column, or be left blank. And the query would be getting quite complex, since we would probably need to detect measurement groups that are not accompanying segmentations, and process those differently. Yet another idea would be to use a concept different from "Finding site" for those annotations (I was reluctant to use that concept from the start, as noted in https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/1218#issuecomment-1372254742, anticipating problems due to the clash of the concept).

Alternatively, we could have a completely separate query that would handle evaluations that are not derived from segmentations. I think this would be easier to understand for the user. I think for v13 we should do just that, and use that query in the notebooks and other materials accompanying the new nnU-Net-BPR-annotations collection.