Open datapythonista opened 1 year ago
This is a great start @datapythonista, this would be more typical (biased to my workflow):
Note: There are certainly more ways to do this most of which depends on your end goal. But these are the things you will have to do regardless of end goal to get the images ready for any downstream analysis.
This is more for exploration similar to what you would do with structured data. So the first couple of things I would want to check about a new imaging data set are:
Exploration after you load into data structure:
import polars
import polars_dicom
# What is the min slice thickness (min and max below I showed min)
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
.filter(
(polars.col('Modality') == 'CT')
)
.select(
image_for_ml=polars.col('Slice Thickness').min())
# What is the min & max slice spacing
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
.filter(
(polars.col('Modality') == 'CT')
)
.select(
image_for_ml=polars.col('Slice Spacing').min())
# How many slices do we have
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
.filter(
(polars.col('Modality') == 'CT')
)
.select(
image_for_ml=polars.col('Patient ID').count()) # assuming 2D slices, if 3D it will be .shape[0] -->[N, R, C]
Image pre-processing:
So this is just queries for exploratory data analysis, in the case of this dataset each scan as an accompanying SEG
ground truth mask and this we do not do any preprocessing too. But we will have it in the same directory and it's also .dcm
:
.dcm
)import polars
import polars_dicom
polars_dicom.scan_directory('data/tciaDownload', recursive=True, with_metadata=True)
.filter(
(polars.col('Modality') == 'CT')
)
.select(
image_for_ml=polars.col('pixeldata').dicom.convert_hu('Rescale Slope', 'Rescale Intercept')
.dicom.window(window=[WW, WL])
.dicom.resample(image, orignal_spacing, new_spacing=0.5)
.dicom.to_png()
)
.dicom.sink_directory('data/processed'))
Some more examples here: https://www.kaggle.com/code/gzuidhof/full-preprocessing-tutorial
The idea here is to create a sample pipeline to illustrate / discuss a possible API for the dicom library.
@sparalic can you check if this makes sense? This would be a Polars pipeline using our library
polars_dicom
, which would add to Polars adicom
accessor equivalent to the.str.
pandas accessor for strings or thedt
one for dates, but for ourdicom
data type.If you're not too familiar with Polars, here you have a sample pipeline with real estate data that already works (ours won't until we implement our library):