datapythonista / dicom_reader

0 stars 1 forks source link

Future API sample #3

Open datapythonista opened 1 year ago

datapythonista commented 1 year ago

The idea here is to create a sample pipeline to illustrate / discuss a possible API for the dicom library.

import polars
import polars_dicom

polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
            .filter(
                (polars.col('Patient ID') == 123)
                & (polars.col('Slice Thickness') < 1.25)
                & (polars.col('Pixel Spacing') < 1)
            )
            .select(
                image_for_ml=polars.col('pixeldata').dicom.resample(new_spacing=0.5)
                                                    .dicom.convert_hu()
                                                    .dicom.to_png()
            )
            .dicom.sink_directory('data/processed'))

@sparalic can you check if this makes sense? This would be a Polars pipeline using our library polars_dicom, which would add to Polars a dicom accessor equivalent to the .str. pandas accessor for strings or the dt one for dates, but for our dicom data type.

If you're not too familiar with Polars, here you have a sample pipeline with real estate data that already works (ours won't until we implement our library):

photo1693767645

sparalic commented 1 year ago

This is a great start @datapythonista, this would be more typical (biased to my workflow):

Note: There are certainly more ways to do this most of which depends on your end goal. But these are the things you will have to do regardless of end goal to get the images ready for any downstream analysis.

This is more for exploration similar to what you would do with structured data. So the first couple of things I would want to check about a new imaging data set are:

Exploration after you load into data structure:

  1. What is the range of the slice thickness --> proxy for image quality and needs to be uniform for downstream analysis
  2. User may want to see what the pixel spacing is and resize it to some value --> important as we often merge datasets from different sources.
  3. Count how many slices each patient has in a single volume -->
import polars
import polars_dicom

# What is the min slice thickness (min and max below I showed min)
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
            .filter(
                (polars.col('Modality') == 'CT')
           )
            .select(
                image_for_ml=polars.col('Slice Thickness').min())

# What is the min & max slice spacing 
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
            .filter(
                (polars.col('Modality') == 'CT')
           )
            .select(
                image_for_ml=polars.col('Slice Spacing').min())

# How many slices do we have 
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
            .filter(
                (polars.col('Modality') == 'CT')
           )
            .select(
                image_for_ml=polars.col('Patient ID').count()) # assuming 2D slices, if 3D it will be .shape[0] -->[N, R, C]

Image pre-processing: So this is just queries for exploratory data analysis, in the case of this dataset each scan as an accompanying SEG ground truth mask and this we do not do any preprocessing too. But we will have it in the same directory and it's also .dcm:

  1. Filter for CT scans only (.dcm)
  2. Select the pixel data and apply general image preprocessing tasks in the following order
  3. Save to some file format in some dir
import polars
import polars_dicom

polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
            .filter(
                (polars.col('Modality') == 'CT')
           )
            .select(
                image_for_ml=polars.col('pixeldata').dicom.convert_hu('Rescale Slope',  'Rescale Intercept')
                                                   .dicom.window(window=[WW, WL]) 
                                                    .dicom.resample(image, orignal_spacing, new_spacing=0.5)
                                                    .dicom.to_png()
            )
            .dicom.sink_directory('data/processed'))

Some more examples here: https://www.kaggle.com/code/gzuidhof/full-preprocessing-tutorial