Fix: reading sequences with undefined length items - length value of item delimitation tag and new item tag will be used to detect the length of the sequence more precise

kursawero commented 1 year ago

The problem:

A DICOM Data Element Value item with undefined length is closed by an Item Delimitation Tag at the end. The previous implementation (ValueRepresentation.js) uses just the tags (4 byte) to for a stack counter that increases by one with every new Item Tag and decreases by one with every Delimitation Tag. In some of our data sets, we have binary data that by coincidence have one or more similar byte sequences. In that case, the end of a sequences is not detected and the parsing fails, because the while loop runs until the end of the data stream.

The fix:

The nema standard defines fixed undefined length attributes after the new item tag and a four byte zero length attribute after the item delimitation tag. We are reading the length attribute in both cases and increasing/decreasing the stack only if the length attribute value is the same as in the standard. Comparing 8 byte instead of 4 byte is more precise and it is more unlikely that binary data have similar sequences by coincidence.

Nema standard definition:

https://dicom.nema.org/medical/dicom/current/output/chtml/part05/sect_7.5.2.html

I added two tests as propsals.

One is a unit test which is a little bit limited, because the readBytes method does a lot. If you do not like the constances, we can put the values directly into the DicomDataReadBufferStreamBuilder. I had in mind to reuse the constants in the code, but it seems a bit tricky, because you would need to cast them for using it in if statements for instance.

The other is a kind of integration test with a specific file that is already part of your reporitory. The "it" folder would be the place for integration tests. I'm open if you prefer another folder structure, but I think it would be good to separate the different tests a little bit.

kursawero commented 1 year ago

Thank you very much for the quick review. I tried to address your comments. Please feel free to resolve the items or if you want changes please reply.

I added the changes as fixups, means we need to squash the commits before it can be merged. I hope that helps to follow the changes easier and we can revert changes easier if necessary.

Regarding other projects. The Cornerstone Dicom Parser has a nice solution, which is using just the Sequence Delimitation Item.

Unfortunately, we would need to change the code a lot to come to that point and the chance to produce bugs is very high. The existing code here has the algorithm to count the open and close tags of elements with undefined length and if the balance is reached (that the sequence has finished) - then it uses the outer loop to find the Sequence Delimitation Tag. The items itself are parsed when the inner while loop is finished. The Cornerstone Dicom Parser reads the Elements immediately (sequences in sequences will be handled with another recursive call of the function). Reaching the Sequence Delimitation Item is the condition to return.

The proposed changes are from my perspective the best balance between fixing the existing code, keeping the risk of bugs low and allow to review the pull request with reasonable effort.

github-actions[bot] commented 1 year ago

:tada: This PR is included in version 0.29.6 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

dcmjs-org / dcmjs

Fix: reading sequences with undefined length items - length value of item delimitation tag and new item tag will be used to detect the length of the sequence more precise #346