update to libdicom v0.4

jcupitt commented 1 year ago

This PR splits the DICOM loader into two parts: a generic DICOM parser, and a layer that uses the parser to build the DICOM tree in memory. This change lets the loader only load the parts of the file that are relevant and can save a huge amount of memory and time.

For example, here's git main libdicom dumping the highest resolution layer of a large (2GB) Leica DICOM file:

$ /usr/bin/time -f %M:%e time dcm-dump 1.3.6.1.4.1.36533.21716722913121316773164147287613518920414871.dcm > log
2.33user 1.37system 0:03.71elapsed 99%CPU (0avgtext+0avgdata 601600maxresident)k
0inputs+210024outputs (0major+150078minor)pagefaults 0swaps
601600:3.72

So it used 600MB of memory and took 3.7s.

Here's the same file with this PR:

$ /usr/bin/time -f %M:%e time dcm-dump 1.3.6.1.4.1.36533.21716722913121316773164147287613518920414871.dcm > log
0.00user 0.01system 0:00.01elapsed 100%CPU (0avgtext+0avgdata 28928maxresident)k
0inputs+24outputs (0major+6875minor)pagefaults 0swaps
28928:0.01

It used 28mb of memory and took 0.01s.

It's a large PR, I'll write a few posts introducing the new design.

jcupitt commented 1 year ago

Here's vipsdisp showing that horrible DICOM:

I have an openslide PR ready to go that updates it for this revised libdicom API.

jcupitt commented 1 year ago

The loader now looks like this internally:

DcmIO (see dicom-io.c) is our virtual IO layer that provides a nice posix-like abstraction over memory, file and other IO sources (openslide uses it to attach libdicom to the openslide IO system).
DcmParse (see dicom-parse.c) is a new thing that uses DcmIO to walk any DICOM file. It triggers a set of callbacks when it recognises objects, so there are methods like dataset_begin(), dataset_end(), element_create(), etc.
DcmFilehandle (see dicom-file.c) is the thing we had before, it's an API for loading DICOMs into memory as big trees of objects you can then examine. It uses DcmParse to walk DICOM files and then calls the tree builders from the callbacks if it decides that part of the DICOM is relevant.

Here's what the new DcmParse thing looks like:

typedef struct _DcmParse {
    bool (*dataset_begin)(DcmError **, void *client);
    bool (*dataset_end)(DcmError **, void *client);

    bool (*sequence_begin)(DcmError **, void *client);
    bool (*sequence_end)(DcmError **,
                         void *client,
                         uint32_t tag,
                         DcmVR vr,
                         uint32_t length);

    bool (*element_create)(DcmError **,
                           void *client,
                           uint32_t tag,
                           DcmVR vr,
                           char *value,
                           uint32_t length);

    bool (*stop)(void *client,
                 uint32_t tag,
                 DcmVR vr,
                 uint32_t length);
} DcmParse;

So a set of six functions which will be called by the parser when it sees a structure. You parse a chunk of file with this:

DCM_EXTERN
bool dcm_parse_group(DcmError **error,
                     DcmIO *io,
                     bool implicit,
                     bool byteswap,
                     const DcmParse *parse,
                     void *client);

You seek to the group you want to parse, then call that thing and the various callbacks will run as the parser scans the group.

dicom-file.c uses that to load file meta like this:

https://github.com/jcupitt/libdicom/blob/add-parser/src/dicom-file.c#L447-L456

So it makes all the callbacks, checks the DICOM header and preamble, makes a DcmSequence to hold the meta, and sets the parser going. The callbacks above that create the tree and link elements into the sequence.

After dcm_parse_group() returns, it extracts the dataset that was constructed, does a little extra checking, and returns.

There are a few other things in dicom-parse.c.

There's something to very quickly scan PerFrameFunctionalGroupSequence in Leica DICOMs and get just the (x, y) position of each tile. The great thing about this new parser design is that we can walk the sequence and build the array in one pass. We don't need to load the tree to memory and then scan that to make the frame index.
There's a thing to scan PixelData quickly to load or build the BOT.
There's something to load a single frame.

All the high level logic remains in dicom-file.c.

jcupitt commented 1 year ago

The user-facing API of libdicom has the following changes:

I bumped the version to 0.4.
I removed DcmBOT, this is now handled automatically.
There's a new API call bool dcm_filehandle_read_pixeldata(DcmError **error, DcmFilehandle *filehandle) which you can call after loading the image metadata. This does all the scanning of the file necessary for frame get to work. If you don't call it, it will be called for you on the first frame get.
Get frame is simpler -- it's now just: DcmFrame *dcm_filehandle_read_frame(DcmError **error, DcmFilehandle *filehandle, uint32_t frame_number). It handles the BOT for you, and also handles sparse tiles, so there's no need to scan PerFrameFunctionalGroupSequence.
dcm-dump no longer displays PerFrameFunctionalGroupSequence.

Questions to consider:

Is removing DcmBOT a good idea? It feels like a useful simplification to me.
Is automatically remapping frames with PerFrameFunctionalGroupSequence a good idea? Again, this feels like a useful simplification to me. It does mean that dcm-getframe no longer takes a frame number (ie. index in PixelData), but instead takes a frame index, meaning row-major position from the top left.
dcm-dump should probably have a -a (meaning all) flag to dump all fields in a DICOM, even the boring ones.
We could expose the new parser in the user-facing API, though I don't know how useful it would be. It would be tricky to call from python, for example. Maybe a flag to dcm_filehandle_read_metadata() meaning "load all metadata you can find, not just the useful ones" would be better.

I'm inclined to address these issues in a follow-up PR.

jcupitt commented 1 year ago

OK, ready for review @hackermd and @bgilbert !

bgilbert commented 1 year ago

(FWIW, I'm not planning to do detailed code reviews of changes in this repo.)

Is automatically remapping frames with PerFrameFunctionalGroupSequence a good idea? Again, this feels like a useful simplification to me. It does mean that dcm-getframe no longer takes a frame number (ie. index in PixelData), but instead takes a frame index, meaning row-major position from the top left.

Yeah, I was noticing that in https://github.com/openslide/openslide/pull/454. It seems like useful functionality, but I could also imagine a caller wanting direct access to a physical frame number. Would it make sense to add a higher-level getframe function that does the mapping and reads the tile, and maybe a helper API function for debugging that only does the mapping? Those functions could even take (row, col) coordinate parameters so the caller doesn't have to do that math.

jcupitt commented 1 year ago

Here's the matching PR in openslide:

https://github.com/openslide/openslide/pull/454

Although tests are failing there. libdicom assumes that TotalMatrixPixelColumns is present, but the sample DICOM openslide is using does not have that tag. Does anyone know if WSI DICOMs always have that tag defined, or is it optional?

jcupitt commented 1 year ago

Would it make sense to add a higher-level getframe function that does the mapping and reads the tile, and maybe a helper API function for debugging that only does the mapping? Those functions could even take (row, col) coordinate parameters so the caller doesn't have to do that math.

Good idea. Perhaps we could add this API in a follow-up PR?

bgilbert commented 1 year ago

libdicom assumes that TotalMatrixPixelColumns is present, but the sample DICOM openslide is using does not have that tag. Does anyone know if WSI DICOMs always have that tag defined, or is it optional?

The DICOM sample in the synthetic test isn't a WSI DICOM; it uses the Multi-frame True Color Secondary Capture Image Storage SOP. We're just using it to test that our linkage to libdicom works correctly.

Good idea. Perhaps we could add this API in a follow-up PR?

This PR removes the caller's ability to read a frame from the DICOM by physical frame number, right? If we're planning to undo that, I think it makes sense to do so in this PR.

jcupitt commented 1 year ago

That's fair. OK, read_frame() now takes a frame number again, as before, while there's a new API call:

DcmFrame *dcm_filehandle_read_frame_position(DcmError **error,
                                             DcmFilehandle *filehandle,
                                             uint32_t column,
                                             uint32_t row)

Which fetches a tile at an (x, y) position, numbering from zero and taking account of PerFrameFunctionalGroupSequence, if necessary.

I'll update the openslide PR to use this new thing.

jcupitt commented 1 year ago

OK, it's now working with openslide, libvips, vipsdisp, and all my test DICOMs. I think this is ready to merge.

jcupitt commented 1 year ago

This is now tested on all platforms. I've merged this to main on my jcupitt/libdicom fork and openslide has been updated to use that.

I'll leave this PR open for reference and in case any review is needed later.

jcupitt commented 1 year ago

This PR has been superseded by https://github.com/ImagingDataCommons/libdicom/pull/56

ImagingDataCommons / libdicom

update to libdicom v0.4 #55