animetosho / Nyuu

Flexible usenet binary posting tool
215 stars 30 forks source link

nyuu accepts unicode in filename, and puts it unescaped in resulting NZB (file poster -> subject) ... which is illegal? #103

Closed sanderjo closed 1 year ago

sanderjo commented 1 year ago

Start: valid filename with unicode:

-rw-rw-r-- 1 sander sander 6 feb 18 08:52 'Hi Kingdom 你好世界.txt'

Post it with nyuu

$ nyuu -h upload.eweka.nl ... < credentials > ... -g alt.binaries.test -o my_cat.nzb -t Hi_Unicode   Hi\ Kingdom\ 你好世界.txt 

[INFO] Uploading 1 article(s) from 1 file(s) totalling 6 B
[INFO] Reading file Hi Kingdom 你好世界.txt...
[INFO] All file(s) read...
[INFO] Finished uploading 6 B in 00:00:00.474 (12.66 B/s). Network upload rate: 1255.81 B/s

So far so good.

But: the NZB contains unescaped unicode in it, which AFAIK is illegal Furthermore, SAB does accepts the NZB, but the download results in a strange filename

$ cat my_cat.nzb 
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE nzb PUBLIC "-//newzBin//DTD NZB 1.1//EN" "http://www.newzbin.com/DTD/nzb/nzb-1.1.dtd">
<nzb xmlns="http://www.newzbin.com/DTD/2003/nzb">
    <file poster="blablamannetje &lt;blabla@example.com&gt;" date="1676706806" subject="Hi_Unicode &quot;Hi Kingdom 你好世界.txt&quot; yEnc (1/1) 6">
        <groups>
            <group>alt.binaries.test</group>
        </groups>
        <segments>
            <segment bytes="158" number="1">JgNaGfJoScMvUaLtShRnQeTe-1676706806181@nyuu</segment>
        </segments>
    </file>
</nzb>
sanderjo commented 1 year ago

https://en.wikipedia.org/wiki/Valid_characters_in_XML tells which Unicode is allowed. And I assume it means escaping like U+E000

But the header of the NZB does say <?xml version="1.0" encoding="utf-8"?> ... so that makes me think UTF-8 is allowed ... ? Is the problem on SAB-side?

animetosho commented 1 year ago

Putting your XML document in a validator such as this, says it's valid.
Unicode is valid in XML, and requires no escaping (though you can, of course, choose to). The utf-8 marker tells the XML parser to interpret the text as UTF-8, so it should be prepared to handle such characters.

I'd imagine most XML parsers should handle the character encoding part without issue. What resulting filename do you actually get?

sanderjo commented 1 year ago

SAB result:

sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ll
total 12
drwxrwxr-x  2 sander sander 4096 feb 18 08:58  ./
drwxrwxr-x 30 sander sander 4096 feb 18 09:04  ../
-rw-rw-r--  1 sander sander    6 feb 18 08:58 'Hi Kingdom 你好ä¸'$'\302\226''ç'$'\302\225\302\214''.txt'
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ls | hd
00000000  48 69 20 4b 69 6e 67 64  6f 6d 20 c3 a4 c2 bd c2  |Hi Kingdom .....|
00000010  a0 c3 a5 c2 a5 c2 bd c3  a4 c2 b8 c2 96 c3 a7 c2  |................|
00000020  95 c2 8c 2e 74 78 74 0a                           |....txt.|
00000028

So four chinese UTF-8 chars turned into 24 octets, so 6 octets per char ...

For reference: manual creation goes OK, of course:

sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ date > "Manually 你好世界.txt"
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ll
total 16
drwxrwxr-x  2 sander sander 4096 feb 18 10:14  ./
drwxrwxr-x 30 sander sander 4096 feb 18 09:04  ../
-rw-rw-r--  1 sander sander    6 feb 18 08:58 'Hi Kingdom 你好ä¸'$'\302\226''ç'$'\302\225\302\214''.txt'
-rw-rw-r--  1 sander sander   28 feb 18 10:14 'Manually 你好世界.txt'
sander@X501A1:~/Downloads/complete/chinese_letters_in_post$ ls Manually\ 你好世界.txt  | hd
00000000  4d 61 6e 75 61 6c 6c 79  20 e4 bd a0 e5 a5 bd e4  |Manually .......|
00000010  b8 96 e7 95 8c 2e 74 78  74 0a                    |......txt.|
0000001a

Four chinese chars are 12 octets, so 3 octects per Chinese char. Which is correct; for example 你 is UTF-8 Encoding: 0xE4 0xBD 0xA0, so 3 octets.

sanderjo commented 1 year ago

So I will check what happens on SAB's side. This is the kind of problem that used to happen with python 2 ... switching between ascii, bytes and utf8

Thanks, @animetosho

animetosho commented 1 year ago

If you take the UTF-8 representation of 你 and assume that it's actually latin1/ISO8859-1, you get "ä½ ".
If you then UTF-8 encode that, you get C3 A4 C2 BD C2 A0, which seems to match what you get.

My guess is that there's some bad UTF-8 -> latin1 casting going on (i.e. assuming UTF-8 text is actually latin1).