Weird-ass unicode characters in title alt links crapping out json

NLZ commented 3 years ago

I'm getting this error when I try to grab this title: https://mangadex.org/title/50185/the-most-notorious-talker-runs-the-world-s-greatest-clan

Traceback (most recent call last):
  File "C:\Users\nlz\Code\mdownloader\mdownloader.py", line 120, in <module>
    before_main(args.id, args.language, args.directory, args.type, args.folder, args.save_format, args.covers)
  File "C:\Users\nlz\Code\mdownloader\mdownloader.py", line 102, in before_main
    main(id, language, directory, type, folder, save_format, covers)
  File "C:\Users\nlz\Code\mdownloader\components\main.py", line 103, in main
    typeChecker(id, language, route, type, make_folder, save_format, covers)
  File "C:\Users\nlz\Code\mdownloader\components\main.py", line 18, in typeChecker
    downloadBatch(id, language, route, type, make_folder, save_format, covers)
  File "C:\Users\nlz\Code\mdownloader\components\downloader.py", line 222, in downloadBatch
    downloadChapter(chapter_id, route, type, title, make_folder, save_format, json_file)
  File "C:\Users\nlz\Code\mdownloader\components\downloader.py", line 153, in downloadChapter
    json_file.core(0)
  File "C:\Users\nlz\Code\mdownloader\components\jsonmaker.py", line 212, in core
    self.saveJson(json_data)
  File "C:\Users\nlz\Code\mdownloader\components\jsonmaker.py", line 193, in saveJson
    json.dump(json_data, json_file, indent=4, ensure_ascii=False)
  File "C:\Python39\lib\json\__init__.py", line 180, in dump
    fp.write(chunk)
  File "C:\Python39\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u300c' in position 127: character maps to <undefined>

\u300c is 「, so a bit digging revealed that the NovelUpdates link has that in it: the-strongest-clans-master-is-the-weakest-and-most-evil-support-class-even-with-a-fail-job「talker」-with-my-brains-and-dependable-allies-abilities-im-the-worlds-strongest-see

I think the solution would be to try to do an url encoding on the links, so like 「 becames %E3%80%8C https://github.com/Rudoal/mdownloader/blob/9114ed091215d9da60ee2f2526c78d6c9a214065/components/jsonmaker.py#L62

So here I would first import urllib.parse, then wrap the links with urllib.parse.quote() (probably all of them, just to be sure).

ArdaxHz commented 3 years ago

Huh, haven't come across this error before, fixed in 2.7.2.

ArdaxHz commented 3 years ago

It now encodes the colon in the URLs too. 🤔

NLZ commented 3 years ago

It now encodes the colon in the URLs too. 🤔

Oh, I see the issue.

https://github.com/Rudoal/mdownloader/blob/8164e67d793838810734871cf6e19298fd094b99/components/jsonmaker.py#L148

Why not just quote the data?

json_links["manga_updates"] = f'https://www.mangaupdates.com/series.html?id={quote(self.manga_data["links"]["mu"])}'

ArdaxHz / mdownloader

Weird-ass unicode characters in title alt links crapping out json #6