DecimalString values that use exponential notation are converted when reading a DICOM

TimMullender commented 3 years ago

Given a DICOM that contains a DS value with exponential notation, eg. DS: [-0.2437435686588, 0.96983969211578, 7.1945578383e-05, 0.00072092906339, 0.00025536940665, -0.9999997019767] When I run DicomMessage.readFile and write Then write should complete without any errors

Currently the write fails with the following error: Error: Value exceeds max length, vr: DS, value: 0.000071945578383, length: 17 at DecimalString.writeBytes

Example code:

const fs = require("fs");
const dcmjs = require("dcmjs");

const dicom = fs.readFileSync("exponentialNotation.dcm");
const readFile = dcmjs.data.DicomMessage.readFile(dicom.buffer);
const buffer = readFile.write();
fs.writeFileSync("output.dcm", buffer);

Example DICOM: exponentialNotation.zip

dcmjs version: 0.16.6 (I've also tried the latest) node version: v12.6.2

pieper commented 3 years ago

Thanks for reporting this 👍

I'm not sure what the best practice is in JS for handling precision in ascii / float conversions. There was a long thread on a similar issue in C++ at dcmqi.

The code is here and here.

This experiment confirms the behavior: String(Number("7.1945578383e-05")).length and returns 17.

We could just truncate the string, but better would be to switch to scientific notation to preserve the significant digits along the lines discussed in the dcmqi issue.

Would appreciate a PR addressing this if you know the right answer. Otherwise we can ask around.

TimMullender commented 3 years ago

I'm struggling to come up with a nice solution that won't break existing functionality.

The issue as I see it is less to do with precision and more about formatting, to be able to associate a format with the values within a DecimalString means changing the type of the values (which may break existing behaviour for some users) or associating a new piece of state with the values (which would require bigger changes to pass the state around)

One simple but slightly crude solution would be to check the length of the value when it's being written and use the exponential if it is longer than the maxLength, this would solve our problem of not being able to write the original DICOM but could still mean that the format of values is being modified between read and write (eg. an exponential value that is less than 16 characters would be written in standard notation)

I can provide a PR for that solution, or does anyone else have other ideas?

pieper commented 3 years ago

I'd be curious how dcmtk and pydicom operate in this case. We could see if we can get the same round trip DS as they do.

We could try to save the original format (standard vs scientific), but in any case we'd need to handle where that same number was generated on the fly by some calculation so we need arbitrary numbers to be converted to DS in the best form possible.

This basic topic came up in a discussion of the DICOM JSON Model being possibly lossy in some respects and I remember @dclunie telling me that the committee had decided that small floating point rounding issues should be ignored.

So I think we should try to find the js number to string formatting options that result in the right number of characters while maintaining as much precision as possible. Perhaps the approach you suggested @TimMullender will work well.

dclunie commented 3 years ago

Since DICOM DS allows scientific notation, it seems to me that the precision should be preserved.

Otherwise 1e-16 represented as a DS (16 char max) in scientific notation would become zero in non-scientific notation, not .0000000000000001 (if I can count 0's correctly).

I guess we needed to consider both string to binary conversions and vice versa, and consider that the two different string representations, one of which (scientific) allows greater precision in the mantissa than the other (non-scientific) depending on the actual value.

JavaScript uses 64 bit floats internally, does it not? That means that 52 bits of precision can be maintained, for which I understand 15 decimal digits is required [1], but unfortunately DICOM only allows 11 (if there is a decimal point, an 'e' and a two digit exponent with a minus sign for the exponent) (or 10 for negative values) for scientific notation and 15 for a positive value that "fits" without resorting to scientific notation, 14 if a sign is required.

See also [2] [3].

[3] contains a late comment that mentions [4] and the code [5], which seems to be the definitive answer on how to maintain round-trip binary/string full fidelity.

I also came across [6] for JavaScript.

Also, when length is an issue, one should round rather than truncate, propagating carries [4].

I should check my toolkits to see what they do, especially for extreme cases. Maybe we should make a list of challenging values to include in tests.

dclunie commented 3 years ago

OK, seems that Gay dtoa is not the last word on this after all ... e.g.:

http://dl.acm.org/doi/10.1145/3360595

Has a video of the presentation at the end.

pieper commented 3 years ago

Thanks for your help with this @dclunie. Yes, js uses 64 bit floating point as the Number type.

Yes, we should keep a list of challenging values like "7.1945578383e-05" and decide best encodings should be given the formats available and the digits allowed. Then we can map those best practices into the language-specific implementations. The experience of @fedorov in the dcmqi issue linked above shows that this won't always be trivial.

Regarding maintaining the original format (scientific vs standard) for dcmjs objects, we could adopt a similar strategy as we use for VR of tags that aren't in the data dictionary. In that case we keep an _vrMap object to map from tag to VR so we can write them out again. In that spirit we could have a _formatMap that could be populated to track any numbers not in standard notation. Then when writing we would consult whether the map exists and write accordingly. This would also allow the creator of a new instance to provide a _formatMap if they wanted to control the format for some reason.

@TimMullender : do you know what software generated your example? It somehow decided that 4 zeros after the decimal point was enough to trigger a switch to scientific notation but 3 zeros was not. Maybe that's just regular %g printf.

fedorov commented 3 years ago

Adding @rfloca and @kislinsk, since in dcmqi we after all went with this PR from @kislinsk https://github.com/QIICR/dcmqi/pull/416. I think for C++ it is more difficult, since as we discovered, some of the details of formatting scientific notation (ie, width of the exponent) are platform-specific, and are not possible to control with API. I believe MITK developers have a lot more expertise with C++ than me, and I wanted to make sure that whatever we do in dcmqi keeps MITK devs happy, so I agreed with the solution they proposed.

kislinsk commented 3 years ago

Once we figured out the maximum possible string lengths from the C++ standard library and IEEE 754 that preserve the very same value between serialization and deserialization, the fix in C++ was easy because the 32-bit floating point numbers were guaranteed to always fit into the DICOM format.

It's a different story for 64-bit floating point numbers in JS. There's no way to store the whole 64-bit precision for all cases in the provided number maximum characters in DICOM even with superior algorithms like Ryū or Dragonbox I guess, but to be honest I doubt that one would need that much precision for real-world values in this context anyway. Rounding seems to be a feasible solution.

rfloca commented 3 years ago

Hi, as a short addition to @kislinsk: Appart form the fact that the precision of the c++ code base decided the issue on a technical level, our assumptions regarding the needed precision where only made with having the serializiation of geometric information (origin, spacing, rotation matrix; and it which point we assume then to be practically equal (implicitly answering the question to which point we have to capture the precision to detect inequality)) in classical medical imaging scenarios in mind. If the same functionality/code is used to serialize any DS, this assumption might not hold, because those values may have another range.

TimMullender commented 3 years ago

@TimMullender : do you know what software generated your example? It somehow decided that 4 zeros after the decimal point was enough to trigger a switch to scientific notation but 3 zeros was not. Maybe that's just regular %g printf.

The example DICOM that is attached is a modified version of a publicly available example so that there was no risk of uploading private data, I don't have too much information from the original as it was scrubbed before we got to see it, I can tell you it was an MRI scan from a Philips MR Imaging device. I assumed the decision was somewhat based on what would fit without losing precision

I'd be curious how dcmtk and pydicom operate in this case

From what I can see, when reading pydicom stores the original string value for writing back if the value has not been changed

TimMullender commented 3 years ago

I've associated a PR for the simple approach, I can look at the bigger change of being able to specify a format and setting the format when reading a value if it's wanted but it may take me some time

pieper commented 3 years ago

There's been more progress on the pydicom related issue.

@TimMullender should we merge your PR #176 as is?

rvantklooster commented 3 years ago

I recently came across the same issue, the value in my case being 6.483748056e-006 which was converted to 0.000006483748056 (length=17). The value was present in the Image Orientation Patient array. The above PR fixes the bug in my case. As I am dealing with reading, anonymizing and writing DICOM files, I think this change will be sufficient. Considering user created values, you could argue that creating a valid decimal is the responsibility of the user.

As far as I see, we can merge the PR. I hope this gives a positive bump.

pieper commented 2 years ago

Thanks for the report @rvantklooster. I just re-read (some of) the long threads on this topic and I think you are right, we're better off merging #176 and then further optimizing later if the need arises. We can re-examine what's being done in C++ and python for interoperability (make sure we're not more lossy than they are).

pieper commented 2 years ago

Unfortunately #176 developed conflicts with the master.

Resolved conflicts and merged - hopefully no mistakes were made.

ohif-bot commented 2 years ago

:tada: This issue has been resolved in version 0.19.1 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

dcmjs-org / dcmjs

DecimalString values that use exponential notation are converted when reading a DICOM #175