de3sw2aq1 / wattpad-ebook-scraper

UNMAINTAINED, use https://github.com/JimmXinu/FanFicFare instead
MIT License
24 stars 8 forks source link

special characters are not parsed properly #6

Closed PescheHelfer closed 8 years ago

PescheHelfer commented 9 years ago

Example story:

python scrape.py https://www.wattpad.com/story/9105367-terminator

--> special characters such as the apostrophe ('), long dashes (–) are not parsed correctly in the epub. This results in question marks when you open the epub in Calibre, and abrupt ends when you open it in Adobe Digital Editions.

Strangely, when I open the html in notepad, the characters are there. But apparently, they are not proper html.

image

PescheHelfer commented 9 years ago

I looked a bit more into this. Using Sigil, I created a version that works properly. Then, I adjusted the version created by the scraper to the point that is was identical with the Sigil. But still, the characters wouldn't show properly. The only thing I did not adjust was the encoding. Hence, I used the File Encoding Checker to compare the encoding of the files. You can find this tool here: https://encodingchecker.codeplex.com/downloads/get/clickOnce/EncodingChecker.application

Indeed, the html files created by Scraper use standard text encoding (windows-1252), whereas the file created by Sigil uses UTF-8:

image

Changing the encoding of the output files to UTF-8 should probably fix the problem. Unfortunately, I have no idea of Python, so I can't fix this myself.

PescheHelfer commented 9 years ago

Ok, I fixed this one on my machine by modifying ez_epub.py: a method was added which converts all html-files to UTF-8 prior to packing them into the archive.

That's probably not the way to go as ez_epub is a project by someone else, but it's a temporary fix for me. That's also why I haven't pulled a request.

If you're interested in the fix, let me know and I will try to upload it.

The following lines were added: image

def _convert_to_utf8(self, outputDir):
    path = outputDir + "/OEBPS/*.html"
    blocksize = 1048576
    for fname in glob.glob(path):
        sourceFilePath = fname
        targetFilePath = fname.replace(".html","_utf8.html")
        with codecs.open(sourceFilePath, "r", "windows-1252") as sourceFile:
            with codecs.open(targetFilePath, "w", "utf-8") as targetFile:
                while True:
                    contents = sourceFile.read(blocksize)
                    if not contents:
                        break
                    targetFile.write(contents)
        os.remove(sourceFilePath)
        os.rename(targetFilePath, sourceFilePath)

P.S. Depending on the OS you are executing the tool, "windows-1252" may not be the codec generated by default. In such a case, the conversion would not work due to the following line: with codecs.open(sourceFilePath, "r", "windows-1252") as sourceFile:

Some kind of automatic detection of the codec is required, here.

PescheHelfer commented 9 years ago

Ok, forget about the above, it worked ok for the terminator story, but failed with https://www.wattpad.com/story/24305532-the-kabul-incident-a-weir-codex-novella

Reason: this story uses more exotic characters such as "Tōhohu Earthquake". The "ō" caused the epub module to crash with an unicode encoding error when the following method was called:

def __writeItems(self):
    items = self.getAllItems()
    for item in items:
        outname = os.path.join(self.rootDir, 'OEBPS', item.destPath)
        if item.html:
            fout = open(outname, 'wt')                         # --> UNICODE ENCODING ERROR
            fout.write(item.html)
            fout.close()
            ...

Not really knowing what I was doing I changed it to:

def __writeItems(self):
    items = self.getAllItems()
    for item in items:
        outname = os.path.join(self.rootDir, 'OEBPS', item.destPath)
        if item.html:
            fout = open(outname, 'wt', encoding='utf-8')        # <-- specified encoding
            fout.write(item.html)
            fout.close()
            ...

This prevents the crash, all characters are now displayed properly, no more need for the modification of ez_epub :)

However, I don't know if it's ok to just modify a class provided by someone else. And I also don't know how to fix it without modifying this class - maybe you can do the encoding somewhere further upstream in the scrape class.

de3sw2aq1 commented 9 years ago

Just to make sure, you are using an up to date version of wattad-ebook-scraper? #3 was recently merged in and was supposed to fix issues like this. ...However it might have broken something. I'll look into this too.

PescheHelfer commented 9 years ago

I am not entirely sure how to check the version. I downloaded it a few days ago. scrape.py starts like this:

!/usr/bin/env python3

import sys import time import json import io

import requests import dateutil.parser from genshi.input import HTML

import ez_epub

Setup session to not hit Android download app page

TODO: Cookies probably aren't needed if only API requests are made

session = requests.session() session.cookies['android-noprompt'] = '1' session.cookies['skip-download-page'] = '1'

That at least indicates python 3.

Also the included readme.md points at pyhton 3:

Usage

List one or more story URLs as command line arguments

$ python3 scrape.py http://www.wattpad.com/story/9876543-example-story http://www.wattpad.com/story/9999999-example-story-2

Could this again be a windows thing? Maybe encoding in windows behaves slightly different. When I investigated the resulting html files before modifying epub.py, the codec was windows-1252.
duncane commented 8 years ago

Hi guys, i'm getting errors, related to this when trying to fetch books written in french.

The patch you have provided earlier, fixed a part of the problem, but i'm still getting errors, with, for exemple this book.

https://www.wattpad.com/story/36029572-la-lotus-chronicles-1

Saving epub Traceback (most recent call last): File "scrape.py", line 100, in download_story(story_url) File "scrape.py", line 50, in download_story print('Story "{story_title}": {story_id}'.format(story_title=story_title, story_id=story_id)) UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 14: ordinal not in range(128)

PescheHelfer commented 8 years ago

Hi duncane

It does work with my modified version (the changes I mentioned on Aug 6).

You can download the epub "Lotus Chronicles" from here: http://we.tl/hESUsQ4erF

@de3sw2aq1: I tried to Pull a request with my changes, but got the following error: Sync failed to push local changes It seems you do not have permission to push your changes to this repository. Request failed. Validation failed.

Would it be possible to grant me this permission? I am quite new to github. If I pull a request, you will be able to review the changes before they get integrated, right?

duncane commented 8 years ago

@spitfireCH i'm more interested in your functionnal version of the script than in the book itself ^^

I just redownloaded the whole script from git, and just applied your patch to the epub.py file, and it's still failling ...

duncane commented 8 years ago

got it locales where not correctly configured on my system, reconfigured it to a UTF-8 one, and it worked ^^

de3sw2aq1 commented 8 years ago

@spitfireCH Can you click the "fork" button at the top right of the page? This will create a second copy of this repository on your GitHub account. You can then push to that version of the repository. After you push to your fork, open it in your web browser, you should see an option to make a pull request.

de3sw2aq1 commented 8 years ago

@spitfireCH I might have fixed it. @duncane's comment that changing the encoding fixed a related issue made me realize that we weren't setting any encodings in epub.py when we write out the files. I've done that now and I think it might fix your issues. Can you test the current version of the scraper now?

@duncane I think you are experiencing a different issue. It seems we're assuming that sys.stdout.encoding is utf-8, or at least is something good enough that we can output anything to it. If it's US-ASCII for example, we can't write Unicode text to it at all though. Setting PYTHONENCODING, LC_ALL, or similar to utf-8 fixes this, but I'm not sure if there's a way I can fix it from within the code?

duncane commented 8 years ago

@de3sw2aq1 i'm not sure you can fix it from the code, but you should trigger an error saying that the current system locale, can't allow to save the book. It will help the user to get it changed and it will also avoid him to get a Perl error message he might not understand ^^

PescheHelfer commented 8 years ago

@de3sw2aq1 I tested the new version. It now works fine with my first example:

python scrape.py https://www.wattpad.com/story/9105367-terminator

but it still fails with the second example: python scrape.py https://www.wattpad.com/story/24305532-the-kabul-incident-a-weir-codex-novella

edit: that of course is due to the colon in the title, as mentioned in issue #5

I'll try to branch a fix for this ...

de3sw2aq1 commented 8 years ago

Closing this as I think this specific issue is solved now.

I'll open a new issue regarding sys.stdout.encoding.