Escape sequences decoding

Efferent-Health / HL7-V2

Lightweight HL7 V2 parser and composer for modern C# / .NET applications

MIT License

40 stars 4 forks source link

Escape sequences decoding #3

Closed piglione73 closed 3 months ago

piglione73 commented 3 months ago

Hi, the HL7Encoding.Decode function apparently misinterprets the "X" escape sequences.

I have a message that contains \X00E0\ , which should be decoded as "à".

However, the Decode function is like this:

default:
      if (seq.StartsWith("X", StringComparison.Ordinal))
      {
          byte[] bytes = Enumerable.Range(0, seq.Length - 1)
              .Where(x => x % 2 == 0)
              .Select(x => Convert.ToByte(seq.Substring(x + 1, 2), 16))
              .ToArray();
          result.Append(Encoding.UTF8.GetString(bytes));
      }
      else
      {
          result.Append(seq);
      }
      break;

So, it processes hexadecimal digits two at a time, and then decodes the two bytes as UTF-8, thus resulting in two wrong characters, the first of which is \0. But the expected result should be the single "à" character.

Shouldn't it decode the four hex digits as a single unicode char, without even mentioning UTF8?

jaime-olivares commented 3 months ago

Hi @piglione73

Please send us a sample message for testing purposes.

piglione73 commented 3 months ago

MSH|^~\&|HELIOS|DEDALUS|||20240609213244||ADT^A01|HL7Gtw018FFE7D23AC00|P|2.5|||AL|AL|D|8859/1
EVN|A01|||RO||20240609213200
PID|1||5928948^^^X1V1_MPI^PI~1053251221^^^HELIOS^ANT~757HA514^^^SS^SS~WLMHLP81R56Z209U^^^NNITA^NNITA||ANONYMIZED ANONYMIZED^ANONYMIZED ANONYMIZED^^^^^^^^^^^^3||19811016|F|||V. ANTONIO BAZZINI 9&V. ANTONIO BAZZINI&9^^ANONYMIZED^^20125^^H^^015146~V. ANTONIO ANONYMIZED 9&V. ANTONIO BAZZINI&9^^ANONYMIZED^^12345^^L^^015146^030~^^SRI LANKA^^^^BDL^^999311||^PRN^PH^^^^^^^^^3279945913|||2||ANONYMIZED^MEF^NNITA|757HA514|||||||^^311^SRI LANKA (gi\X00E0\ CEYLON)||||N
PV1|1|I|XOSTPIO^^^ICHPIO^^^^^ANONYMIZED|4|P2024126713||ANONYMIZED^ANONYMIZED^TOMMASO||||1HB^602^D02|||3|||ANONYMIZED^ANONYMIZED^ANONYMIZED||G2024005887^^^PRZ^VN|1||||||||||||||||||||||||20240609213200

The wrong part should be like this when decoded:

SRI LANKA (già CEYLON)

jaime-olivares commented 3 months ago

Please try with the new version 3.0.3

piglione73 commented 2 months ago

Hi @jaime-olivares, I confirm it now works. Thanks!

patterson-philip commented 2 months ago

Thanks for your work on this issue! We ran into an issue yesterday of a few HL7 files not being able to parse and it was because of this encoding bug. Updating the library unblocked us.