image.Header(): UTF decoding error in JSON library

Lestropie commented 6 years ago

As reported on forum here.

Somehow a non-UTF8 byte is being exported by the mrinfo -json_all option. I'm guessing it's something in a comment field or the like; maybe a MGZ that was converted prior to #970, or something like that. Will know more with sample data.

civier commented 6 years ago

I think I experience a non-UTF8 in dcminfo -all with data collected at the Florey. I'm not familiar with the underlying code, but may it be related?

These are the problematic fields:

[DCM] 0008 0016 UI       26      104   SOPClassUID                            [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1140 SQ      306      656 > ReferencedImageSequence                
[DCM] FFFE E000 UN       94      664   - Item                                 
[DCM] 0008 1150 UI       26      672     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      706     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.201712051133052344508765^@ ]
[DCM] FFFE E000 UN       94      766   - Item                                 
[DCM] 0008 1150 UI       26      774     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      808     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.2017120511331296101608773 ]
[DCM] FFFE E000 UN       94      868   - Item                                 
[DCM] 0008 1150 UI       26      876     ReferencedSOPClassUID                [ 1.2.840.10008.5.1.4.1.1.4^@ ]
[DCM] 0008 1155 UI       52      910     ReferencedSOPInstanceUID             [ 1.3.12.2.1107.5.2.19.45056.2017120511331689997308777 ]

Lestropie commented 6 years ago

dcminfo -all only ever shows the result of handling by the C++ code; which admittedly there are sometimes fields stored within DICOM to indicate the use of alternative encodings, and in those situations our DICOM handling may not be ideal, but it's not something that's ever become sufficiently problematic for us to address.

The issue with mrinfo -json_all is specifically because it's a trick I developed to import header information from any image format supported by MRtrix3 into Python, without having to explicitly code support for all image formats in Python. But it relies on the Python json module to parse the resulting file, which is where (at least for Python 2.7) there appears to be a stumbling point. The fields you refer to there in the DICOM are not ever stored in the image header, and therefore should never be contained in the mrinfo -json_all output. But something is making it in there; my guess is it's a subject name with funky encoding that our C++ DICOM code is essentially importing byte-wise and then dumping into the JSON file.

civier commented 6 years ago

For me, the "dcminfo -all" problem was quite probelmatic because I fed it into grep (which didn't work as expected), but I figured out I can use "grep -a" to solve it. So as long as it doesn't bother others, I'm ok. I'll put myself a note to clarify in the dcminfo documentation that it won't handle alternative encodings and non-ASCII output is to be expected sometimes. Worse is better.

Lestropie commented 6 years ago

If it's causing problems for you, it's worth adding to the issue list; but it should be listed as a separate GitHub issue, since the required modifications are quite separate from the issue originally listed here.

We have discussed DICOM encoding either here or on the forum, but I don't seem to be able to find it right now. But previously when we've discussed this, it's been regarding subject names with e.g. accent characters; whereas the example you've provided there is very much not that. It would be worth double-checking that there isn't anything particularly unusual about the encoding of those specific fields that our code isn't aware of, that there isn't any kind of buffer overrun occurring, and that your data aren't just plain corrupt; whatever it is, we need to confirm the source of the issue before doing anything about it.

jdtournier commented 6 years ago

@lestropie, I think the issues you were looking for were #1296 and #953.

I think we have a potential solution for non-ASCII encodings (which is most likely to be the windows cp1252) within the DICOM handling - but that's unlikely to be the issue reported here with dcminfo -all: only the last character looks badly interpreted in those fields, and there's no reason for these UID fields to contain non-ASCII characters. I often find non-ASCII in my dcminfo -all output, and resort to grep -a for the same reason...

The reason for that last mangled character is not entirely clear to me, but it's hard to rule out a slightly buggy DICOM encoder here. This likely relates to how DICOM entries are stored: the number of bytes for each DICOM field is explicitly provided in the stream (for example, the first one has length 26 bytes - 5th column in the output). But the strings stored have odd lengths (I count 25 characters for the first one). The DICOM standard used to state that fields should have even length, presumably to ensure alignment at 16 bit boundaries (although I've had to relax our DICOM checks on that since I've encountered data that doesn't honour that - pretty sure it was Siemens too). So the question is what should be stored in that last byte? If that was a null character, I don't think that would cause any trouble, since that's the end of string delimiter anyway. But it looks like Siemens (used to) store some other value in there. I'll have a look at some of the data I have from the Florey Trio, see if there's a simple fix...

This won't solve the json issues though...

jdtournier commented 6 years ago

OK, turns out those dangling characters were indeed null... I've pushed a fix in #1349, which seems to sort out the issue - no more Binary file (standard input) matches when feeding dcminfo -all output to grep...

MRtrix3 / mrtrix3

image.Header(): UTF decoding error in JSON library #1347