SNH48Live / KVM48

The Koudai48 VOD Manager
MIT License
10 stars 0 forks source link

codepage error #3

Closed xyleey closed 5 years ago

xyleey commented 5 years ago

While kvm48 merges m3u8 urls into the m3u8.txt file, those Chinese characters in file name are under codepage windows-1254 instead of utf-8. This issue cause en error when I consume the m3u8.txt with caterpillar. Debug information is attached as a screenshot. l6 v oygfyzzsekon y xp Besides, when I create a new .txt file and type in those Chinese characters manually under UTF-8, this file works really well in caterpillar batch mode. Therefore, I guess it would be the case of output error with kvm48. I need some professional help on this issue very much. Thanks !

zmwangx commented 5 years ago

Okay, apparently my decision to add chardet to caterpillar turned out to be a dumpster fire. I've used uchardet to great success in the past, but I guess for short strings it's almost as good as a PRNG. I realized this as I released the feature, to which end I quote:

Add encoding detection for batch manifest (with varying degrees of success due to sample size, better just use UTF-8).

In other words, it fucking, doesn't, work.

So I'm following my own advice now: just use UTF-8. v0.1.5 does just that. Hopefully it addresses the situation.


It would still help to see a sample though. I used to know the Python 3's Windows default encoding rules (yes, wrestled with it long before this), but in practice it seems all over the place.

If you know how to upload m3u8.txt to a gist without butchering the encoding (e.g., with defunkt/gist), please share it as a gist; otherwise, an ephemeral file sharing site like file.io works for me too.

P.S. I've been very busy and off QQ for at least a month. I'll be back though... at some point.

xyleey commented 5 years ago

https://file.io/w66xtZ Feel so grateful and reliable that you are still here ! As I am not really understand your requirement for that gist, I just upload the original file of m3u8.txt on file.io and the link is attached above. For more information:

  1. I use notepad++ as my default text editor, therefore the m3u8.txt was produced into .txt under that software conditions (I don't know this may cause any problems or not... for now...)
  2. After simple experiments, I recognised the issue just exists by those Chinese characters in the first two lines in my m3u8.txt, i.e.:
    image Despite of these two items, everything goes very well. So it is a really specific problem... I guess.

I start trying to understand more of your descriptions and codes in your repositories and just attempt to tackle my confusion by myself with loony175's suggestions recently... and I'm updating files in the SNH48Live repo as well... of course manually.. 0v0 Thanks for your existence again and looking forward to see you back.

zmwangx commented 5 years ago

Okay, chardet is detecting perfect UTF-8 content as Turkish just because of an emoji. Behold (a screencast):

asciicast

Facepalm.

Also, TIL chardet/chardet is not a binding for uchardet, which detects this perfectly, as demonstrated above.

Therefore, caterpillar v0.8 didn't solve the encoding problem; it caused the encoding problem. v0.9 fixes it.

xyleey commented 5 years ago

Alright it's my negligence in experiments that I had not try to attempt the m3u8.txt which only involved the 张语格's url. Things go really nice. So yes, this should be the emoji case. And I've just searched for all VOD information from the year start, I found there was only 陈思 used that emoji for several m3u8 VODs. Those VODs works not good in batch mode as well. For instance: image (jesus why I've never realised it before...)

Yeah it works now. Very appreciate for your upgrades.

By the way, you are still overflowing much cuty like before hhhhhhhh....

zmwangx commented 5 years ago

PSA: KVM48 can now be used on Windows natively (no WSL). caterpillar is natively integrated too so it can be automatically invoked (currently requires a beta version of caterpillar which will be promoted to stable in due course). Details are in the release notes [1].

[1] https://github.com/SNH48Live/KVM48/releases/tag/v0.3