Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.3k stars 211 forks source link

What about languages with non-latin characters? #775

Open thomas694 opened 1 year ago

thomas694 commented 1 year ago

Description

There occurred a problem with emojis on a windows system using the default file name scheme {REDDITOR}_{TITLE}_{POSTID}. The screenshot in #221 describes a problem with unicode characters of the logger. As a fix (#222) all non-ascii characters are removed. Are emojis a problem in windows filesystems? What about foreign language characters like e.g. japanese, korean or chinese?

A simple solution is to add a line to set the encoding to UTF-8 in create_file_logger in connector.py#L224:

        file_handler = logging.handlers.RotatingFileHandler(
            log_path,
            mode="a",
            backupCount=backup_count,
            encoding="utf-8"
        )

Out of curiosity, does the logger behave differently on linux systems or are unicode characters just missing in the log files?

Can we get back the bigger range of characters by skipping that _strip_emojis method? Probably through a new option to remain backward compatible.

Serene-Arc commented 1 year ago

No, this is not possible. Unicode characters are included in Linux log files, but Windows uses a severely restricted character set known as Windows-1252. If you want the full character set, then the only solution is to run the BDFR on Linux, Unix, or a derivative system, such as MacOS.

thomas694 commented 1 year ago

The changed version works pretty well here.

Second, I'm sure Windows uses a code page that fits the region it is used in, but not always 1252 (437 here). But if you tell your program to write files in UTF-8, instead of an automatically chosen code page by the library, the files for sure contain unicode characters [when writing unicode characters] and the files are more independent from regional settings and the like.

Serene-Arc commented 1 year ago

If Windows doesn't already write UTF-8 to log files, that can be done with an enhancement.