animetosho / Nyuu

Flexible usenet binary posting tool
221 stars 32 forks source link

Headers should be ASCII/latin1 only. #100

Closed zkte closed 1 year ago

zkte commented 1 year ago

Usenet Indexers like binsearch, nzbindex, nzbking all assume headers are ASCII/latin1. Posting in UTF-8 results in garbage.

animetosho commented 1 year ago

Oops, looks like I forgot to expose the encoding option. Thanks for raising the issue!

There's now a --article-encoding option which can be set like --article-encoding latin1.

The default is still utf8 as RFC 3977 defines NNTP to use UTF-8. The original NNTP RFC states 7-bit ASCII encoding, i.e. latin1 was never a valid encoding for NNTP.

thezoggy commented 1 year ago

is there any way an indexer/app that generates+has the nzb to know what article-encoding option was used? as stuff I see that people posted with this mangle the original unicode titles on torrents by the time it ends up in the nzb / indexed

animetosho commented 1 year ago

I don't think there's any way unfortunately. The encoding specified in the NZB's XML tag refers to the NZB's encoding itself, not what's used for the article headers. If the uploader generates the NZB, the data there should be accurate, and I'd imagine that downloader applications would prefer the NZB info.

However indexers don't really have a way to know what encoding was used (unless the uploader generated NZB was handed to them), so NZBs generated by indexers could be suspect. According to RFC 3977, indexers should be assuming UTF-8, so there should be no ambiguity, but as requested by OP, this doesn't always match reality.

thezoggy commented 1 year ago

yeah, indexer is set to do everything in utf8. but trying to figure out if its poster or indexer. per poster name its a bot, so figure it could just be bad script/setting on that side that maybe is unknown. not normal content i look at, so i dont have historical info to know if its a trend or what.

animetosho commented 1 year ago

Are you referring to files posted in the wild by Nyuu?
If so, keep in mind that this option was only added 2 weeks ago, isn't in any released version, and requires someone to know about the flag and set it that way. So if the files you're seeing was posted more than 2 weeks ago, it's not due to this flag.

Otherwise I can't really speculate what someone may be doing. It could be the fault of Nyuu, or it could be a fault elsewhere. If you can replicate the issue and tell me how you did it, I can investigate on Nyuu's side, otherwise your best bet is to ask whoever manages the bot to investigate.

thezoggy commented 1 year ago

Yes, something like: example or another indexer which has a bit of tweaks to base indexer software, example1

which goes with OP comment, as it appears on nzbindex like: example2

animetosho commented 1 year ago

For that one, the post is correctly using UTF-8, but as zkte says, those indexers are incorrectly interpreting it as latin1.
Nothing can be done about it on the poster or Nyuu's side - the only people who can do anything about it are the indexers.

NZBKing seems to do the correct thing with that post, (though I have seen cases where it somewhat fails with characters outside the BMP).

thezoggy commented 1 year ago

i see other stuff on the site with unicode just fine just not stuff from this bot, so figured it was the bot/app posting. example

firefox_2023-02-18_18-49-22

maybe indexer is doing some de-obfuscation method which is causing the encoding mangling. ill pass this thread along so they can review

animetosho commented 1 year ago

Interesting. I'm guessing that indexer is getting the name from somewhere else (e.g. a submitted NZB).

If you want to give me more details, I can try looking into what they're doing.

thezoggy commented 1 year ago

Interesting. I'm guessing that indexer is getting the name from somewhere else (e.g. a submitted NZB).

If you want to give me more details, I can try looking into what they're doing.

To rule out indexers, looking at groups directly via newsbin since cant use raw search site. I can confirm that the post from nyuu does look fine there: NewsbinPro64_2023-02-19_12-59-56

so it really must be the indexer not handling it for whatever reason, and while indexer can do utf8 and some things are fine, something is going on where these releases are generally not handled. I relayed this post over to nzedb/nn so they could look at the backend software to make sure there on the indexering software side that is the cause or atleast can point the finger better at an indexer on why its not being handled.

animetosho commented 1 year ago

Thanks.

Do you have a NZB (or similar) of the 'xpost' posts that look fine on the indexer?

thezoggy commented 1 year ago

Thanks.

Do you have a NZB (or similar) of the 'xpost' posts that look fine on the indexer?

they are the wtfnzb ones. you can get it from .su directly or i can share it with ya (once api limits reset as i dont use this site usually). not sure if your on any of the discords im on or if you have email you want me to send it to

animetosho commented 1 year ago

I don't know how the title text is obtained (don't have the NZB), but I see that the files have the same problem, suggesting that the Subject is interpreted as latin1:

Clipboard01

If you want to give the NZB and don't want to post it here, you can send a link here.

thezoggy commented 1 year ago

worked with nn team and believe we figured out what is going on.

assumption is that things are utf8 when sometimes they arent. tentative fix to test for utf8 first before trying to convert encoding to utf8 to store in db. still digging into some other stuff but hopefully should make things better for all nn indexers that eventually get the update. nzedb then most likely would apply a similar fix if all looks good

animetosho commented 1 year ago

Glad that got solved then, thanks for doing all that!
(if it's a MySQL DB, hopefully they're doing utf8_mb4 instead of utf8)

thezoggy commented 1 year ago

Glad that got solved then, thanks for doing all that! (if it's a MySQL DB, hopefully they're doing utf8_mb4 instead of utf8)

they use utf8, i passed along this to nudge in mb4 direction: https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci