dcmjs-org / dcmjs

Javascript implementation of DICOM manipulation
https://dcmjs.netlify.com/
MIT License
300 stars 112 forks source link

fix(lossles-round-trip): Account for padding byte in numeric strings with multiplicity #401

Closed craigberry1 closed 2 months ago

craigberry1 commented 2 months ago

Fixes issue during lossless read/write with numeric string value representations (IS/DS). This was caused by a string like: 0.99924236548978\-0.0322633220972\-0.0217663285287\0.02949870928067\0.99267261121054\-0.1171789789306 where the space following the final number is correctly added for evenness, but produced an error during re-write as the unformatted value exceeded the max length of the VR.

Solution was to remove this padding byte from _rawValue property during read if it exceeds the max length of the vr. This padding will be re-added by the writer if needed for evenness.

netlify[bot] commented 2 months ago

Deploy Preview for dcmjs2 ready!

Name Link
Latest commit 2b57f106c26e4e28d88bd20b6b459a16564cfdb7
Latest deploy log https://app.netlify.com/sites/dcmjs2/deploys/66d889e42998e4000820aa6b
Deploy Preview https://deploy-preview-401--dcmjs2.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

craigberry1 commented 2 months ago

This can be an issue for other VRs as well. Converting to draft while coming up with a more generalized solution.

craigberry1 commented 2 months ago

Fixed generically by add dropPadByte check when strings are split for multiplicity. Aims to track specific case when the last element of an multiple values data element has is exactly 1 byte over max length and that byte is the VR's pad byte. In this case we drop that character as it is not considered part of the original value but was just added for evenness. The writer will add the padding back if needed.

wayfarer3130 commented 2 months ago

This looks good to me, thanks for the work and the PR 👍

I'd like if we could get a second set of eyes on it, just to be sure Iso 'll request another reviewer to have a look.

I think always removing the pad byte from numeric strings is a more consistent solution than trying to get the exact correct length. It just seems like that leaves the possibility of errors when the length is that exact length as compared to shorter lengths or vice-versa.

craigberry1 commented 2 months ago

This looks good to me, thanks for the work and the PR 👍 I'd like if we could get a second set of eyes on it, just to be sure Iso 'll request another reviewer to have a look.

I think always removing the pad byte from numeric strings is a more consistent solution than trying to get the exact correct length. It just seems like that leaves the possibility of errors when the length is that exact length as compared to shorter lengths or vice-versa.

The issue I had was trying to determine whether the pad byte/space was significant or added for evenness. Many VR's (like the numeric strings) allow for whitespace so unless the parsed string exceeds the max length there is nothing saying the whitespace shouldn't be there.

This originally come out of work for lossless round trip processing where the goal has been to preserve exactly what is in the source DICOM file.

With that approach in mind my thinking was to only manipulate the original string in a very specific circumstance, in this case being the final element in the string containing a single byte longer than allowed and it being equal to the padding byte. In any other case I leave as-is and let the writer determine if the value is valid during write.

craigberry1 commented 2 months ago

Hey @wayfarer3130 I wanted to check back in. I agree with your sentiment re: I think the pad byte is an artifact of the binary encoding, and should always be stripped. The tricky part is identifying when a padding byte was actually added.

For example the spec for DS states: Decimal Strings may be padded with leading or trailing spaces. So we don't want to strip off a whitespace character that was intended to be there:

Example

In this case I don't think we can determine whether a padding SPACE was added for evenness or was part of the value, and don't think we should manipulate it.


This error I'm seeing is a very specific use case where a data element with multiplicity > 1 has a total encoded length that is odd, and the final value is already at max length.

The added padding byte achieves even length, but increases the final element above its individual max length:

Example 0.99924236548978\-0.0322633220972

Stripping the padding byte makes sense here IMO as the whitespace is not part of the original value, it was clearly added for padding to create evenness.


So I understand hesitation around parsing special logic though am thinking this is a way we can definitively detect a padding byte and not make other assumptions about the values.

But let me know what you think and I'll follow your direction!

wayfarer3130 commented 2 months ago

@craigberry1 - as long as you only remove the pad byte on the last item when it is at an odd position, you should be fine as it will then get re-added. I agree the standard is ambiguous there, but I don't quite see how to fix it at this point.

craigberry1 commented 2 months ago

@craigberry1 - as long as you only remove the pad byte on the last item when it is at an odd position, you should be fine as it will then get re-added. I agree the standard is ambiguous there, but I don't quite see how to fix it at this point.

Thanks for the response @wayfarer3130, I was away for a few days. I wrote up another comment but deleted it after some more thought because I think the odd length check you suggested is sufficient in this case. I just want to be careful not to regress on the lossless round trip work.

I'll update this PR shortly. Thanks for the help.

craigberry1 commented 2 months ago

@pieper @wayfarer3130 Ready for another look when you get a chance. Thanks!

pieper commented 2 months ago

I think that only stripping a space at the end of the string is the right thing. I don't think any reasonable implementation would add a byte anywhere else to make the buffer and even length. Sounds like we are all agreed then.