Closed xyleey closed 5 years ago
cp1254
;windows-1254
;Okay, apparently my decision to add chardet to caterpillar turned out to be a dumpster fire. I've used uchardet to great success in the past, but I guess for short strings it's almost as good as a PRNG. I realized this as I released the feature, to which end I quote:
Add encoding detection for batch manifest (with varying degrees of success due to sample size, better just use UTF-8).
In other words, it fucking, doesn't, work.
So I'm following my own advice now: just use UTF-8. v0.1.5 does just that. Hopefully it addresses the situation.
It would still help to see a sample though. I used to know the Python 3's Windows default encoding rules (yes, wrestled with it long before this), but in practice it seems all over the place.
If you know how to upload m3u8.txt
to a gist without butchering the encoding (e.g., with defunkt/gist), please share it as a gist; otherwise, an ephemeral file sharing site like file.io works for me too.
P.S. I've been very busy and off QQ for at least a month. I'll be back though... at some point.
https://file.io/w66xtZ Feel so grateful and reliable that you are still here ! As I am not really understand your requirement for that gist, I just upload the original file of m3u8.txt on file.io and the link is attached above. For more information:
I start trying to understand more of your descriptions and codes in your repositories and just attempt to tackle my confusion by myself with loony175's suggestions recently... and I'm updating files in the SNH48Live repo as well... of course manually.. 0v0 Thanks for your existence again and looking forward to see you back.
Okay, chardet is detecting perfect UTF-8 content as Turkish just because of an emoji. Behold (a screencast):
Facepalm.
Also, TIL chardet/chardet is not a binding for uchardet, which detects this perfectly, as demonstrated above.
Therefore, caterpillar v0.8 didn't solve the encoding problem; it caused the encoding problem. v0.9 fixes it.
Alright it's my negligence in experiments that I had not try to attempt the m3u8.txt which only involved the 张语格's url. Things go really nice. So yes, this should be the emoji case. And I've just searched for all VOD information from the year start, I found there was only 陈思 used that emoji for several m3u8 VODs. Those VODs works not good in batch mode as well. For instance: (jesus why I've never realised it before...)
Yeah it works now. Very appreciate for your upgrades.
By the way, you are still overflowing much cuty like before hhhhhhhh....
PSA: KVM48 can now be used on Windows natively (no WSL). caterpillar
is natively integrated too so it can be automatically invoked (currently requires a beta version of caterpillar
which will be promoted to stable in due course). Details are in the release notes [1].
While kvm48 merges m3u8 urls into the m3u8.txt file, those Chinese characters in file name are under codepage windows-1254 instead of utf-8. This issue cause en error when I consume the m3u8.txt with caterpillar. Debug information is attached as a screenshot. Besides, when I create a new .txt file and type in those Chinese characters manually under UTF-8, this file works really well in caterpillar batch mode. Therefore, I guess it would be the case of output error with kvm48. I need some professional help on this issue very much. Thanks !