ImagingDataCommons / libdicom

C library for reading DICOM files
https://libdicom.readthedocs.io
MIT License
16 stars 7 forks source link

Add getframe #37

Closed jcupitt closed 1 year ago

jcupitt commented 1 year ago

This PR depends on the new IO system -- that should be merged first.

This PR adds a new command-line too, dcm-getframe, which reads a single frame out of a WSI DICOM. This is supposed to be useful for testing DcmBOT and libdicom's ability to handle the various flavours of WSI that are being used.

Sample output with the largest level in the 3D Histech sample WSI DICOM (this file is 1.4GB and contains over 10,000 frames):

$ time dcm-getframe -o x.jpg Histech\^Theresa\ \[1473843\]/2.25.111454577819918515926489701058493557552.dcm 5000

real    0m0.011s
user    0m0.000s
sys 0m0.004s

So it's able to parse the DICOM file, extract and parse the basic offset table, fetch the tile, and write it to disk in 11ms. Pretty good I think!

If you use -v (verbose) it dumps some extra extra information:

$ time dcm-getframe -v -o x.jpg Histech\^Theresa\ \[1473843\]/2.25.111454577819918515926489701058493557552.dcm 7777
INFO     [Tue Feb 14 13:57:19 2023] - Read file 'Histech^Theresa [1473843]/2.25.111454577819918515926489701058493557552.dcm'
INFO     [Tue Feb 14 13:57:19 2023] - Read metadata
INFO     [Tue Feb 14 13:57:19 2023] - Read BOT
INFO     [Tue Feb 14 13:57:19 2023] - Read Basic Offset Table value.
INFO     [Tue Feb 14 13:57:19 2023] - Read frame 7777
INFO     [Tue Feb 14 13:57:19 2023] - frame number = 7777
INFO     [Tue Feb 14 13:57:19 2023] - length = 220848 bytes
INFO     [Tue Feb 14 13:57:19 2023] - rows = 1024
INFO     [Tue Feb 14 13:57:19 2023] - columns = 1024
INFO     [Tue Feb 14 13:57:19 2023] - samples per pixel = 3
INFO     [Tue Feb 14 13:57:19 2023] - bits allocated = 8
INFO     [Tue Feb 14 13:57:19 2023] - bits stored = 8
INFO     [Tue Feb 14 13:57:19 2023] - high bit = 7
INFO     [Tue Feb 14 13:57:19 2023] - pixel representation = 0
INFO     [Tue Feb 14 13:57:19 2023] - planar configuration = 0
INFO     [Tue Feb 14 13:57:19 2023] - photometric interpretation = YBR_FULL_422
INFO     [Tue Feb 14 13:57:19 2023] - transfer syntax uid = 1.2.840.10008.1.2.4.50

real    0m0.006s
user    0m0.006s
sys 0m0.000s

The written JPEG files seem correct.

It fails for the sample Leica DICOMs:

$ dcm-getframe -o x.jpg ~/Leica_Dicom.dcm/1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm 1
ERROR    [Tue Feb 14 14:08:14 2023] - Parse error: Reading of Data Element failed - Encountered unexpected Value Multiplicity 4 for Data Element '0040A160'

They don't use TILED_FULL so I guess the issues is in there somewhere. I noticed what look like some problems in extended BOT handling, which could also be the issue. But we can come back to that.

With libdicom reading frames from at least some slides nicely, I'll move on to adding a DICOM vendor to openslide and circle back to libdicom later.

jcupitt commented 1 year ago

@hackermd you might be interested in the timings above ^^

hackermd commented 1 year ago

@hackermd you might be interested in the timings above ^^

Very impressive! Glad to see that the implementation is that fast. Great work @jcupitt!!

hackermd commented 1 year ago

It fails for the sample Leica DICOMs:

dcm-getframe -o x.jpg ~/Leica_Dicom.dcm/1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm 1
ERROR    [Tue Feb 14 14:08:14 2023] - Parse error: Reading of Data Element failed - Encountered unexpected Value Multiplicity 4 for Data Element '0040A160'

They don't use TILED_FULL so I guess the issues is in there somewhere. I noticed what look like some problems in extended BOT handling, which could also be the issue. But we can come back to that.

The error appears to be due to a standard conformity issue. The attribute indeed has an incorrect Value Multiplicity.

@dclunie do you have an image with TILED_SPARSE dimension organization handy that is fully standard compliant and would be suitable for testing frame-level data access?

dclunie commented 1 year ago

(0040,A160) is Text Value, which has a VR of UT, which by definition does not have multiple values (any embedded backslash in the text is not a value delimiter), so whatever parser is saying "unexpected Value Multiplicity 4" is incorrect.

jcupitt commented 1 year ago

I think you're right David, it looks like UT is trying to parse the string value. I've made a separate issue for this:

https://github.com/ImagingDataCommons/libdicom/issues/39

jcupitt commented 1 year ago

... I fixed it on this branch. It now reads the Leica DICOMs correctly, though it can't fetch frames as there's no BOT.

With a largeish (710mb) Leica DICOM level I see:

$ time dcm-getframe -v -o x.jpg 1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm 1
INFO     [Thu Feb 16 10:27:48 2023] - Read file '1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm'
INFO     [Thu Feb 16 10:27:48 2023] - Read metadata
INFO     [Thu Feb 16 10:27:49 2023] - Read BOT
INFO     [Thu Feb 16 10:27:49 2023] - Basic Offset Table is empty.

real    0m1.017s
user    0m0.472s
sys 0m0.545s

Parsing the sparse tile metadata is extremely painful. We could probably speed the parser up a bit more.

dclunie commented 1 year ago

... I fixed it on this branch. It now reads the Leica DICOMs correctly, though it can't fetch frames as there's no BOT.

You can't depend on the presence of BOTs - you need to be able to build your own if absent (with lazy instantiation on the 1st get frame call if you want to defer the time it takes to do this unless it turns out to be necessary).

This is not too hard if there is one fragment per frame - just walk all the Items in the PixelData noting their offset as you go. If there is not a match between the NumberOfFrames and the number of items (+1 for the empty BOT item), then you have fragmentation (more than one item per frame) and you need to figure out which items are the start of frames (or which is an end item meaning the next is a start item - I do this for JPEG-family Transfer Syntaxes by walking back from the end of each item looking for JPEG EOI marker segments (accounting for trailing padding to an even byte length) rather than walking forward parsing the JPEG bitstream.

jcupitt commented 1 year ago

Thanks for the tips, David.

This is not too hard if there is one fragment per frame

Yes, I do this with fo-dicom to read Leica files, then cache the offset maps for later reuse.

I do this for JPEG-family Transfer Syntaxes by walking back from the end of each item looking for JPEG EOI marker segments

Ah good idea! Thanks.

hackermd commented 1 year ago

... I fixed it on this branch. It now reads the Leica DICOMs correctly, though it can't fetch frames as there's no BOT.

You can't depend on the presence of BOTs - you need to be able to build your own if absent (with lazy instantiation on the 1st get frame call if you want to defer the time it takes to do this unless it turns out to be necessary).

@jcupitt We have implemented the dcm_file_build_bot() for this purpose. The getframe program will need to check for the presence of a BOT and if none is found, try to build it instead of failing right away.

jcupitt commented 1 year ago

The getframe program will need to check for the presence of a BOT and if none is found, try to build it instead of failing right away.

Ah gotcha, I'll change it. There seem to be a few problems with build bot, I'll try to fix them too.

jcupitt commented 1 year ago

Oooof that was painful, but it seems to work now, with and without a BOT. On a TILED_FULL image with a BOT I see:

$ dcm-getframe -v -o x.jpg ../Histech\^Theresa\ \[1473843\]/2.25.231896281549030369487042280447616527925.dcm 2
INFO     [Sat Feb 18 17:22:48 2023] - Read file '../Histech^Theresa [1473843]/2.25.231896281549030369487042280447616527925.dcm'
INFO     [Sat Feb 18 17:22:48 2023] - Read metadata
INFO     [Sat Feb 18 17:22:48 2023] - Read BOT
INFO     [Sat Feb 18 17:22:48 2023] - Read Basic Offset Table value.
INFO     [Sat Feb 18 17:22:48 2023] - Read frame 2
INFO     [Sat Feb 18 17:22:48 2023] - frame number = 2
INFO     [Sat Feb 18 17:22:48 2023] - length = 221482 bytes
INFO     [Sat Feb 18 17:22:48 2023] - rows = 1024
INFO     [Sat Feb 18 17:22:48 2023] - columns = 1024
INFO     [Sat Feb 18 17:22:48 2023] - samples per pixel = 3
INFO     [Sat Feb 18 17:22:48 2023] - bits allocated = 8
INFO     [Sat Feb 18 17:22:48 2023] - bits stored = 8
INFO     [Sat Feb 18 17:22:48 2023] - high bit = 7
INFO     [Sat Feb 18 17:22:48 2023] - pixel representation = 0
INFO     [Sat Feb 18 17:22:48 2023] - planar configuration = 0
INFO     [Sat Feb 18 17:22:48 2023] - photometric interpretation = YBR_FULL_422
INFO     [Sat Feb 18 17:22:48 2023] - transfer syntax uid = 1.2.840.10008.1.2.4.50

And on a BOT-less Leica dcm I see:

$ dcm-getframe -v -o x.jpg 1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm 17
INFO     [Sat Feb 18 17:23:24 2023] - Read file '1.3.6.1.4.1.36533.992031277215326518414019311377114211255162.dcm'
INFO     [Sat Feb 18 17:23:24 2023] - Read metadata
INFO     [Sat Feb 18 17:23:25 2023] - Read BOT
INFO     [Sat Feb 18 17:23:25 2023] - Basic Offset Table is empty
INFO     [Sat Feb 18 17:23:25 2023] - Build BOT
INFO     [Sat Feb 18 17:23:25 2023] - Read frame 17
INFO     [Sat Feb 18 17:23:25 2023] - frame number = 17
INFO     [Sat Feb 18 17:23:25 2023] - length = 5506 bytes
INFO     [Sat Feb 18 17:23:25 2023] - rows = 256
INFO     [Sat Feb 18 17:23:25 2023] - columns = 256
INFO     [Sat Feb 18 17:23:25 2023] - samples per pixel = 3
INFO     [Sat Feb 18 17:23:25 2023] - bits allocated = 8
INFO     [Sat Feb 18 17:23:25 2023] - bits stored = 8
INFO     [Sat Feb 18 17:23:25 2023] - high bit = 7
INFO     [Sat Feb 18 17:23:25 2023] - pixel representation = 0
INFO     [Sat Feb 18 17:23:25 2023] - planar configuration = 0
INFO     [Sat Feb 18 17:23:25 2023] - photometric interpretation = YBR_FULL_422
INFO     [Sat Feb 18 17:23:25 2023] - transfer syntax uid = 1.2.840.10008.1.2.4.50
hackermd commented 1 year ago

Thanks for your efforts @jcupitt. I can image that it was painful, but am glad that you figured it out. At some point, we will need to take the extended offset table into account, too. However, we may want to leave that for another day.