coop-care / paid

PAID is library for care billing with payers in Germany according to § 105 SGB XI and § 302 SGB V. The project name is an acronym and stands for "Pflegeabrechnung in Deutschland".
GNU Lesser General Public License v3.0
10 stars 3 forks source link

Transcoding between UTF-8 and ISO-8859-1, DIN 66003 DRV, DIN 66303 in der Fassung von 1986-11 #12

Closed michaelkamphausen closed 2 years ago

michaelkamphausen commented 3 years ago

UNOC definiert dass alle Zeichen in dem Interchange in ISO 8859-1 (also nicht UTF-8) kodiert sind. Das könnte zu Problemen bei Sonderzeichen führen. Zum Beispiel könnte ja shcon reichen wenn jemand ein "Ä" im Namen hat. Weiß nicht ob es eine einfache Art gibt, in JS Strings um-zukodieren.

Copy&paste aus der Dokumentation:

Syntax identifier ISO standard Languages
UNOA 646  
UNOB 646  
UNOC 8859 - 1 Danish, Dutch, English, Faeroese, Finnish, French, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish
UNOD 8859 - 2 Albanian, Czech, English, Hungarian, Polish, Romanian, Serbo-Croatian, Slovak, Slovene
UNOE 8859 - 5 Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian, Ukrainian
UNOF 8859 - 7 Greek

Originally posted by @westnordost in https://github.com/coop-care/paid/pull/11#r611511306

michaelkamphausen commented 3 years ago

Interesting finding. Transcoding could be possible using the TextEncoder / TextDecoder API.

I suppose transcoding from ISO-8859-1 to UTF-8 is not a problem, but the other way for sure. What will happen to special turkish or polish charaters in names? We need to test that. I suggest we open an issue and come back to this later.

westnordost commented 3 years ago

So, the bills need to be transferred in the charset as specified by the data acceptance office (Datenannahmestelle).

The Datenannahmestellen have the following options:

See Anlage 15 for the description and char code tables.

So, we need to implement transcoding from UTF-8 to these three options.

westnordost commented 3 years ago

We asked the GKV-Spitzenverband what to do with names that do use characters not in those tables. The answer was that in these cases, the names must be transliterated.

In den von Ihnen beschriebenen Fällen müssen die Namen transliteriert werden.

I don't think we need to go so far as to transliterate names from non-Latin script. I see a two-step process here:

westnordost commented 3 years ago

ISO 7-Bit, Code gemäß DIN 66003 DRV (Deutsche Referenzversion)

The bad news: Is unfortunately really 7-bit. There is no such thing as an UInt7Array in Javascript. I am not sure if it is possible to implement it oneself - I think not.

The good news: It seems like Medent (the only Datenannahmestelle which uses that charset), is not linked by any Kostenträger. In other words, noone uses the services of that particular Datenannahmestelle anymore.

So, it looks like the only charset that needs to be supported right now is:

DIN 66303 in der Fassung von 1986-11 (Deutsche Referenz-Version des 8-Bit-Code (DRV8))

michaelkamphausen commented 3 years ago

Thanks for your research! I find this is very helpful.