TOPdesk / dart-junitreport

An application to generate JUnit XML reports from dart test runs.
https://pub.dartlang.org/packages/junitreport
MIT License
31 stars 45 forks source link

fix: :bug: Remove bad unicode for XML #23

Closed Tokenyet closed 2 years ago

Tokenyet commented 3 years ago

This PR fix the following potential problem in XML.

    <testcase classname="D:.Project.McDedicatedServer.mc_dedicated_server_go.test.app_tests.repository.ngrok_repository" name="Ngrok User start ngrok service." time="0.011">
      <system-out></system-out>
    </testcase>

There is a special unicode in system-out <system-out></system-out>, and will be identified as one empty space in some editors, but actually It's 0x01, SOH(start of headline), I have no idea why It produce the strange unicode, but this post help this PR to fix the issue.

UTF-8? Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /u modifier can be used on the regex

$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string); This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u modifier. But it makes life easier if you want to remove other chars...

If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)

In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /u modifier in place, you can simply add \xA0 to the character class:

$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);

In conclusion, If we don't handle this, XML will be identified as CORRUPTED, I found this in my gitlab ci, so I make this PR! By the way, I personally kept the \n in the code, since I think users as me would love to see It directly😄 , and this won't make XML corrupted.

rspilker commented 2 years ago

I prefer that the xml package addresses this problem.

rspilker commented 2 years ago

Instead of replacing those control characters with a space, I used an XML unicode escape to still keep the data as is. See 09820c52c9c39c825567864b73d44402bc742b1b