Page UTF-8 encoding issue?

evmer / perlego-downloader

Download books from Perlego.com in PDF format

MIT License

106 stars 52 forks source link

Page UTF-8 encoding issue? #13

Closed reallyallnamestaken closed 1 year ago

reallyallnamestaken commented 1 year ago

I would come across some books with what looks to be character encoding issues. These would be seemingly random pages (though always the same pages if I redo the download) in only certain books.

characters such as â€¢Â or ÂÂÂ, etc will appear across these pages.

reallyallnamestaken commented 1 year ago

I found the issue to be due to the web pages being UTF-8 encoded, but no header is set to let the browser know this. As a workaround I modified the code to the needed header to the cache file.

f = open(f'{cache_dir}/{chapter_no}.html', 'w', encoding='utf-8')
#below line does the magic
f.write('<meta charset="utf-8" />\n')
f.write(content)
f.close()

evmer commented 1 year ago

Hello, thank you for reporting this to me. I modified the script with your proposed fix.