Umlauts are encoded in the xml file

congressus / sepa-direct-debit

Create Direct Debit SEPA XML files

66 stars 35 forks source link

Umlauts are encoded in the xml file #14

Open binarious opened 10 years ago

binarious commented 10 years ago

I'm having a problem with the commits https://github.com/congressus/sepa-direct-debit/commit/54d0c427147c33d168c611204d2ed992df5af1e4 and https://github.com/congressus/sepa-direct-debit/commit/9ceabbf1100afd0da3e2a82c49ea49d609c7b114. They encode umlauts to html representations. Why is this function used? I assume to prevent XML injection, but that's covered by DOMDocument (or to be more accurate by libxml's xmlOutputBufferWriteEscape function which is used by libxml's xmlNodeDump which is used by DOMDocument's saveXML.) The htmlentities usages should be removed. I tested that against PHP 5.3 and 5.4 resulting in correctly written umlauts and escaped html/xml tags.

JelleM-Congressus commented 10 years ago

Had to dig into this a bit (we have just released our python version today, so we were kind of busy), but yes, I think removing this is a good idea. Also umlauts likely won't work, most (all?) banks restrict the character set to Latin-1 (even though the document is UTF-8):

a b c d e f g h i j k l m n op q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 ! # $ % * ~ / -? : ; = @ [ \ ] ^ _ { | } ( ) . , ‘ + Space

binarious commented 10 years ago

I think so, too. Do you have any reference for the character set restriction on the name values? I read that just the identifying elements (MessageId, PaymentInformationId, InstructionId and EndToEndId) need to be without special characters.

JelleM-Congressus commented 10 years ago

Well, there is a lot of documentation on SEPA (all banks have their own manual). Some saying ALL text should be Latin-1 some saying only the identifying elements. I DO know from experience that special characters are not allowed (even if they are htmlencoded) bij most banks and strongly urge you NOT to use them. Also read something about a german SUPA which does allow it but not quite sure on that one.

si4dev commented 8 years ago

This issue might be solves with the use of createTextNode() instead of htmlentities(). As currently with htmlentities it will encode all html advanced characters like umlauts. And with createTextNode it will only catch the XML exceptions and keep the document XML well formed. So it will result in less changes in the string and keep more of it's original.

I also understand the reaction that probably umlauts are not allowed or at least not for all banks. But it seems that for the issue owner binarious it worked without the htmlentities() conversion.

In another project I experienced myself that I couldn't add a string to the dom without replacing the XML reserved characters. Even with the second parameter on createElement(name, value) it will NOT do the XML escaping. It needs a second text node creation with createTextNode() to create a text node with escaping support.