Loss of precision when serializing Decimal String (DS) and Integer String (IS)

craigberry1 commented 3 months ago

When reading a DICOM file, data elements with VR IS or DS are currently serialized by reading as ASCII Strings, applying custom formatting, and then converting to Number.

Examples:

1.0000 drops padded zeroes and yields 1
+1.234 drops the + sign
9007199254740993 is read as 9007199254740992 due to floating point precision
456 drops whitespace to 456

My read from the spec is that these are all considered valid values:

A string of characters representing either a fixed point number or a floating point number. A fixed point number shall contain only the characters 0-9 with an optional leading "+" or "-" and an optional "." to mark the decimal point. A floating point number shall be conveyed as defined in ANSI X3.9, with an "E" or "e" to indicate the start of the exponent. Decimal Strings may be padded with leading or trailing spaces. Embedded spaces are not allowed.

This leads to fundamental changes in the data element value by simply reading and then writing a DICOM file.

Example

When reading in a file containing the data elements:

(0x0018,0x1041) DS Contrast/Bolus Volume     VR=<DS>   VL=<0x0010>  <9007199254740993>
(0x0018,0x1190) DS Focal Spot(s)     VR=<DS>   VL=<0x0006>  <00.000>

A read & write of it using dcm-js with no other modifications yields:

(0x0018,0x1041) DS Contrast/Bolus Volume     VR=<DS>   VL=<0x0010>  <9007199254740992> // modified value
(0x0018,0x1190) DS Focal Spot(s)     VR=<DS>   VL=<0x0002>  <0 > // loss of significant digits

Assuming that these are in fact valid values according to the spec, why does the library try to apply custom formatting to the value like dropping whitespace and the + sign, or bother converting to a Number which is subject to Floating Point rounding issues? Why not leave as a String and let clients decide when/how to parse as a number?

pieper commented 3 months ago

There's been discussion of this in the past. See: https://github.com/dcmjs-org/dcmjs/issues/53, https://github.com/dcmjs-org/dcmjs/pull/97, https://github.com/dcmjs-org/dcmjs/pull/99

Issues haven't seemed to have come up in practice, but if you find that these are real use cases where lost data has an impact then it would make sense to implement a lossless round trip strategy. A workable solution would be to save the string representation in a local spot like _originalVR and use it for writing if the Number version of the data hasn't changed since reading. I think it's more 'natural' to have the values as Numbers for processing but keeping them lossless for IO certainly makes sense.

It would be interesting to see if pydicom or other tools handle this.

craigberry1 commented 3 months ago

I've done some testing with dcm4che and it appears it follows a "lossless" approach. It may apply some opinionated formatting when explicitly writing/overwriting tag values into its dataset, but it will leave any existing unaltered values as-is.

My understanding is it does this by maintaining each data element value in its original form as a ByteArray and simply concatenating that onto the output stream at write time. It provides an interface (getString(int tag), getDouble(int tag), etc.) for rendering tags for processing, but its up to the user of the library to decide if/when to invoke this API.

I am interested in implementing some form of lossless round trip parsing in dcm-js. I am working on an anonymizer and ran into this issue in real files where precision from decimal strings was being lost (specifically numbers like 1.000 or 0.000), but in general the wider issue to me is that tags I did not anonymize are being altered at all.

Do you have any guidance on implementing this in such a way that would be accepted as a contribution? My thinking was to add read/write flags that would opt-out to any formatting applied past the initial serialization from the data view.

I could add an _originalVR property as you mentioned though there would need to be state/logic that determines whether the value has actually been changed or not at write time.

BTW its a smaller issue but one other manifestation of this that I have seen is around trimming whitespace. It appears many VR's run either a trim() or rtrim() post-processing on parsed Strings in cases where the spec calls out treating whitespace as significant. So I would want to handle this case as well.

craigberry1 commented 3 months ago

One more use case I want to highlight that is similar but might warrant a separate issue/discussion.

I followed the recent change to implement ParsedUnknownValue VR in https://github.com/dcmjs-org/dcmjs/pull/379, which will read an unknown vrType if the meta dictionary is aware of the tag.

There is a similar "error" here where reading/writing a DICOM file will change an unknown/unimplemented VR into UN when written back to the file.

There are few VR's that I don't believe have been implemented yet (OV, OL, SV, UV) but I thought as long as the meta dictionary contained the tags themselves the data element (tag, vr, length, value) would be unchanged after read/write.

So I am thinking about ways to preserve the vrType on write even if it is written using UN implementation.

pieper commented 3 months ago

I like the idea of being lossless wherever possible, and supporting all valid VRs.

I don't think we'd need to track the modified state of any elements, just keep the _originalVR along with the 'Number' version of it, and then when writing if parsing the _originalVR is the same as the Number just write out the _orignalVR. This wouldn't add much overhead in space or time.

Regarding whitespace I guess the same could be done. I think trailing spaces are only added when the value needs to be an even length, but again in the spirit of losslessness we could try to keep track of when the spaces are stripped and write out the original if trimming it results in the same string as the exposed value. In both these cases I'd try to avoid changing the current API behavior.

Regarding UN that may be handled in part by the _vrMap field at the top level of the datasets or maybe it needs to be extended, I haven't looked in detail for a while.

craigberry1 commented 3 months ago

The problem in my mind would be that due to the applied formatting when converting to Number we would always have a difference on the _originalVR (assuming this represents the original raw ascii string) in the cases I listed regardless of whether it was the value was actually manipulated or not.

So a file with value:

(0x0018,0x1190) DS Focal Spot(s)     VR=<DS>   VL=<0x0006>  <00.000>

Would be read as:

{
  "meta": {
    // ...
  },
  "dict": {
    "00181190": {
      "vr": "DS",
      "Value": [0],
      "_originalVR": ["00.000"]
    }
  }
}

pieper commented 3 months ago

What I meant was that if you run Number(0) and it's the same as Number("00.000") then you would deduce that it hadn't been changed externally and would write out "00.000" . I think this would work the same in the other cases you listed and wouldn't result in losing any intentional changes in the values.

craigberry1 commented 3 months ago

I see what you mean. The formatting that is applied during each value representation read would need to be extracted so it could be re-applied onto _originalVR and then compared with Value. I will look into this and aim to put out a PR for lossless reads and writes. At minimum with fixes for the cases here (numeric strings, whitespace, unknown/unimplemented VRs)

pieper commented 3 months ago

formatting that is applied during each value representation read would need to be extracted

Exactly - ideally there would be a parsing function for each VR that would be called in both places (reading and before writing).

dcmjs-org / dcmjs

Loss of precision when serializing Decimal String (DS) and Integer String (IS) #398