Closed craigberry1 closed 3 months ago
There's been discussion of this in the past. See: https://github.com/dcmjs-org/dcmjs/issues/53, https://github.com/dcmjs-org/dcmjs/pull/97, https://github.com/dcmjs-org/dcmjs/pull/99
Issues haven't seemed to have come up in practice, but if you find that these are real use cases where lost data has an impact then it would make sense to implement a lossless round trip strategy. A workable solution would be to save the string representation in a local spot like _originalVR
and use it for writing if the Number
version of the data hasn't changed since reading. I think it's more 'natural' to have the values as Number
s for processing but keeping them lossless for IO certainly makes sense.
It would be interesting to see if pydicom or other tools handle this.
I've done some testing with dcm4che
and it appears it follows a "lossless" approach. It may apply some opinionated formatting when explicitly writing/overwriting tag values into its dataset, but it will leave any existing unaltered values as-is.
My understanding is it does this by maintaining each data element value in its original form as a ByteArray
and simply concatenating that onto the output stream at write time. It provides an interface (getString(int tag)
, getDouble(int tag)
, etc.) for rendering tags for processing, but its up to the user of the library to decide if/when to invoke this API.
I am interested in implementing some form of lossless round trip parsing in dcm-js. I am working on an anonymizer and ran into this issue in real files where precision from decimal strings was being lost (specifically numbers like 1.000
or 0.000
), but in general the wider issue to me is that tags I did not anonymize are being altered at all.
Do you have any guidance on implementing this in such a way that would be accepted as a contribution? My thinking was to add read/write flags that would opt-out to any formatting applied past the initial serialization from the data view.
I could add an _originalVR
property as you mentioned though there would need to be state/logic that determines whether the value has actually been changed or not at write time.
BTW its a smaller issue but one other manifestation of this that I have seen is around trimming whitespace. It appears many VR's run either a trim()
or rtrim()
post-processing on parsed Strings in cases where the spec calls out treating whitespace as significant. So I would want to handle this case as well.
One more use case I want to highlight that is similar but might warrant a separate issue/discussion.
I followed the recent change to implement ParsedUnknownValue
VR in https://github.com/dcmjs-org/dcmjs/pull/379, which will read an unknown vrType
if the meta dictionary is aware of the tag.
There is a similar "error" here where reading/writing a DICOM file will change an unknown/unimplemented VR into UN
when written back to the file.
There are few VR's that I don't believe have been implemented yet (OV
, OL
, SV
, UV
) but I thought as long as the meta dictionary contained the tags themselves the data element (tag, vr, length, value) would be unchanged after read/write.
So I am thinking about ways to preserve the vrType
on write even if it is written using UN
implementation.
I like the idea of being lossless wherever possible, and supporting all valid VRs.
I don't think we'd need to track the modified state of any elements, just keep the _originalVR
along with the 'Number' version of it, and then when writing if parsing the _originalVR
is the same as the Number
just write out the _orignalVR
. This wouldn't add much overhead in space or time.
Regarding whitespace I guess the same could be done. I think trailing spaces are only added when the value needs to be an even length, but again in the spirit of losslessness we could try to keep track of when the spaces are stripped and write out the original if trimming it results in the same string as the exposed value. In both these cases I'd try to avoid changing the current API behavior.
Regarding UN
that may be handled in part by the _vrMap
field at the top level of the datasets or maybe it needs to be extended, I haven't looked in detail for a while.
The problem in my mind would be that due to the applied formatting when converting to Number
we would always have a difference on the _originalVR
(assuming this represents the original raw ascii string) in the cases I listed regardless of whether it was the value was actually manipulated or not.
So a file with value:
(0x0018,0x1190) DS Focal Spot(s) VR=<DS> VL=<0x0006> <00.000>
Would be read as:
{
"meta": {
// ...
},
"dict": {
"00181190": {
"vr": "DS",
"Value": [0],
"_originalVR": ["00.000"]
}
}
}
What I meant was that if you run Number(0)
and it's the same as Number("00.000")
then you would deduce that it hadn't been changed externally and would write out "00.000" . I think this would work the same in the other cases you listed and wouldn't result in losing any intentional changes in the values.
I see what you mean. The formatting that is applied during each value representation read would need to be extracted so it could be re-applied onto _originalVR
and then compared with Value
. I will look into this and aim to put out a PR for lossless reads and writes. At minimum with fixes for the cases here (numeric strings, whitespace, unknown/unimplemented VRs)
formatting that is applied during each value representation read would need to be extracted
Exactly - ideally there would be a parsing function for each VR that would be called in both places (reading and before writing).
When reading a DICOM file, data elements with VR
IS
orDS
are currently serialized by reading as ASCII Strings, applying custom formatting, and then converting toNumber
.Examples:
1.0000
drops padded zeroes and yields1
+1.234
drops the+
sign9007199254740993
is read as9007199254740992
due to floating point precision456
drops whitespace to456
My read from the spec is that these are all considered valid values:
This leads to fundamental changes in the data element value by simply reading and then writing a DICOM file.
Example
When reading in a file containing the data elements:
A read & write of it using dcm-js with no other modifications yields:
Assuming that these are in fact valid values according to the spec, why does the library try to apply custom formatting to the value like dropping whitespace and the
+
sign, or bother converting to aNumber
which is subject to Floating Point rounding issues? Why not leave as a String and let clients decide when/how to parse as a number?