Closed sanderjo closed 1 year ago
https://en.wikipedia.org/wiki/Valid_characters_in_XML tells which Unicode is allowed. And I assume it means escaping like U+E000
But the header of the NZB does say <?xml version="1.0" encoding="utf-8"?>
... so that makes me think UTF-8 is allowed ... ?
Is the problem on SAB-side?
Putting your XML document in a validator such as this, says it's valid.
Unicode is valid in XML, and requires no escaping (though you can, of course, choose to). The utf-8 marker tells the XML parser to interpret the text as UTF-8, so it should be prepared to handle such characters.
I'd imagine most XML parsers should handle the character encoding part without issue. What resulting filename do you actually get?
SAB result:
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ll
total 12
drwxrwxr-x 2 sander sander 4096 feb 18 08:58 ./
drwxrwxr-x 30 sander sander 4096 feb 18 09:04 ../
-rw-rw-r-- 1 sander sander 6 feb 18 08:58 'Hi Kingdom ä½ å¥½ä¸'$'\302\226''ç'$'\302\225\302\214''.txt'
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ls | hd
00000000 48 69 20 4b 69 6e 67 64 6f 6d 20 c3 a4 c2 bd c2 |Hi Kingdom .....|
00000010 a0 c3 a5 c2 a5 c2 bd c3 a4 c2 b8 c2 96 c3 a7 c2 |................|
00000020 95 c2 8c 2e 74 78 74 0a |....txt.|
00000028
So four chinese UTF-8 chars turned into 24 octets, so 6 octets per char ...
For reference: manual creation goes OK, of course:
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ date > "Manually 你好世界.txt"
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ll
total 16
drwxrwxr-x 2 sander sander 4096 feb 18 10:14 ./
drwxrwxr-x 30 sander sander 4096 feb 18 09:04 ../
-rw-rw-r-- 1 sander sander 6 feb 18 08:58 'Hi Kingdom ä½ å¥½ä¸'$'\302\226''ç'$'\302\225\302\214''.txt'
-rw-rw-r-- 1 sander sander 28 feb 18 10:14 'Manually 你好世界.txt'
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ls Manually\ 你好世界.txt | hd
00000000 4d 61 6e 75 61 6c 6c 79 20 e4 bd a0 e5 a5 bd e4 |Manually .......|
00000010 b8 96 e7 95 8c 2e 74 78 74 0a |......txt.|
0000001a
Four chinese chars are 12 octets, so 3 octects per Chinese char. Which is correct; for example 你 is UTF-8 Encoding: 0xE4 0xBD 0xA0, so 3 octets.
So I will check what happens on SAB's side. This is the kind of problem that used to happen with python 2 ... switching between ascii, bytes and utf8
Thanks, @animetosho
If you take the UTF-8 representation of 你 and assume that it's actually latin1/ISO8859-1, you get "ä½ ".
If you then UTF-8 encode that, you get C3 A4 C2 BD C2 A0
, which seems to match what you get.
My guess is that there's some bad UTF-8 -> latin1 casting going on (i.e. assuming UTF-8 text is actually latin1).
Start: valid filename with unicode:
-rw-rw-r-- 1 sander sander 6 feb 18 08:52 'Hi Kingdom 你好世界.txt'
Post it with nyuu
So far so good.
But: the NZB contains unescaped unicode in it, which AFAIK is illegal Furthermore, SAB does accepts the NZB, but the download results in a strange filename