JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.
Other
764 stars 165 forks source link

[proposal with PoC] Another way to deal with fanfiction.net #677

Closed nsapa closed 3 years ago

nsapa commented 3 years ago

Hi,

My ISP' subnet are tagged bot at Cloudfare so I always see the captcha on FanFiction.net. I have seen use_browser_cache feature but it doesn't work in my use case (too much fiction to update).

I hacked some python to drive firefox through selenium and create a pseudo proxy. It detect if the CloudFare captcha and let you complete it manually before retrying the request (and answering the client). It is very buggy because it is my first time into this area of Python+Selenium and I don't understand everything. But it was able to download a whole story from fanfiction.net so the concept work.

The reason I opened this issue is to receive guidance on how to hook this into FanFicFare. Currently, I replace the _fetchUrl function in fanficfare/adapters/adapter_fanfictionnet.py. I know it is the wrong spot. Where should I hook this?

Best regards, Nicolas SAPA

JimmXinu commented 3 years ago

You appear to be starting from an old version of FFF. _fetchUrl(() has been removed entirely and the network layer of FFF refactored. You probably want to start looking in fanficfare/fetcher.py.

I did look at using selenium to get around Cloudflare, but I dropped it. My notes indicate that was because it didn't work with the Calibre plugin version of FFF, and there weren't easy ways through selenium to download images or make POST requests. POST doesn't matter for ffnet, but people do care about their cover images.

I'm honestly not interested in supporting a selenium using CLI version myself, but if you can get it working reasonably cleanly and are willing to stick around and support it, I will consider including it.

nsapa commented 3 years ago

I did a git fetch on an already cloned repositery from 2020 but that doesn't follow branch renaming. I am going to study fetcher.py and see what I can do.

EDIT: You can run Javascript in the browser context and get back data. So I can download image.

Current sample session:

$ ./ff_content.py 
2021-03-26 18:33:38.929 CET INFO root selenium-firefox-proxy version 0.1 by Nicolas SAPA <nico@byme.at>
2021-03-26 18:33:38.930 CET INFO root This Alpha software is licensed under CECILL-2.1
2021-03-26 18:33:41.932 CET INFO root Will listen on port 8888
2021-03-26 18:33:58.234 CET INFO root Current URL = https://www.fanfiction.net/s/10273521/1/Songbird, page title = Attention Required! | Cloudflare, mimetype = text/html
Complete the captcha then press enter
2021-03-26 18:34:31.423 CET INFO cloudfare_clickcaptcha Found the storytext!
2021-03-26 18:34:33.052 CET INFO root Current URL = https://www.fanfiction.net/s/10273521/1/Songbird, page title = Songbird Chapter 1, a sleeping beauty fanfic | FanFiction, mimetype = text/html
2021-03-26 18:34:51.153 CET INFO root Current URL = https://www.fanfiction.net/s/10273521/4/Songbird, page title = Songbird Chapter 4, a sleeping beauty fanfic | FanFiction, mimetype = text/html
2021-03-26 18:35:09.732 CET INFO root Current URL = https://www.fanfiction.net/image/4807542/75/, page title = (WEBP Image, 75 × 100 pixels), mimetype = image/webp
^C2021-03-26 18:35:18.226 CET INFO signal_handler Got SIGINT, breaking the main loop...
^C2021-03-26 18:35:21.109 CET INFO signal_handler Got SIGINT a second time, exiting

First time, I did a echo 'https://www.fanfiction.net/s/10273521/1/Songbird' | nc -v localhost 8888, CloudFare triggered. I did the captcha, the page reloaded in Firefox and I pressed enter. Then I received the HTML from the fiction.

Second time, I did echo 'https://www.fanfiction.net/s/10273521/4/Songbird' | nc -v localhost 8888. No issue this time with Cloudfare, I got the HTML without delay

Third, I did echo 'https://www.fanfiction.net/image/4807542/75/' | nc -v localhost 8888. I got the binary content and, after removing my header, it was the correct cover image.

So now, let's see fetcher.py :)

nsapa commented 3 years ago

Here my new PoC patch

$ python fanficfare/cli.py -c personnal.ini -d 'https://www.fanfiction.net/s/10273521/1/Songbird' 
FFF: DEBUG: 2021-03-26 20:30:17,724: cli.py(193):     OS Version:Linux-5.10.6-200.fc33.x86_64-x86_64-with-glibc2.32
FFF: DEBUG: 2021-03-26 20:30:17,724: cli.py(194): Python Version:3.9.2 (default, Feb 20 2021, 00:00:00) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
FFF: DEBUG: 2021-03-26 20:30:17,724: cli.py(195):    FFF Version:4.0.13
FFF: DEBUG: 2021-03-26 20:30:17,736: configurable.py(981): use_cloudscraper:False
FFF: DEBUG: 2021-03-26 20:30:17,736: configurable.py(982): use_fanfictionnet_ff_proxy:true
FFF: DEBUG: 2021-03-26 20:30:17,737: configurable.py(1003): use_browser_cache:
FFF: DEBUG: 2021-03-26 20:30:17,737: configurable.py(1017): use_basic_cache:true
<...>
FFF: INFO: 2021-03-26 20:34:36,840: writer_epub.py(360): Saving EPUB Version 2.0
FFF: DEBUG: 2021-03-26 20:34:37,001: cli.py(72): Successfully wrote 'Songbird-ffnet_10273521.epub'

The generated epub contains the right cover image.

Are you happy with this approch (FanFicFare talk to another software that block until the user manually validate the captcha) ?

JimmXinu commented 3 years ago

Seems reasonable. I have some minor quibbles, but I'm prepared to try it as long as you realize I'm going to send anyone asking questions about this to you. :-)

Have you tried with windows or macos? Or the Calibre plugin? It may work, since it isn't importing selenium directly. I also predict issues with firewall/anti-virus packages.

nsapa commented 3 years ago

I tested with the Calibre plugin It is slow: it downloaded 364 metadata in nearly 2 hours.

Something is wrong with my socket usage: it didn't got the full HTML code 34 times. I will update the code to retry when size_expected != bytes_recd then raise a FailedToDownload exception.

I didn't test on Windows, I will try today. I don't have any MacOS system.

nsapa commented 3 years ago

Selenium is on Windows is not easy to install. Writing a user-friendly documentation will be a pain.

I started a download of the 34 failed fictions with Calibre x64 on a Windows 10 LTSC 2019 (with the proxy running under Python 3.9: https://pix.milkywan.fr/uGUERAFz.png I return with the result.

JimmXinu commented 3 years ago

I tested with the Calibre plugin It is slow: it downloaded 364 metadata in nearly 2 hours.

I wouldn't expect to be able to download metadata for 364 stories from ffnet that quickly. I'm surprised you didn't get blocked entirely.

nsapa commented 3 years ago

The download of 34 fictions on Windows was a success: https://pix.milkywan.fr/JVKZbSc4.png The VM went to sleep during the chapter download but nothing crashed or raised an exception. And it look like the retry logic was not triggered.

That's a different behaviour from the last tries under Linux were it triggered on every big fictions. Maybe setting the socket to blocking helped. I don't know...

Anyone with a Mac here willing to try?

JimmXinu commented 3 years ago

I've pushed a branch named proxy for you. Split out into a separate file and changed the settings names.

use_nsapa_proxy:true
nsapa_proxy_address:127.0.0.1
nsapa_proxy_port:8888

It worked for me on Windows with cursory testing, but I had to kill the proxy in task manager to end it because ctrl-c didn't work. Windows Defender popped up and wanted a confirmation before it would allow 'Python' to listen to the port.

nsapa commented 3 years ago

Thank you.

I don't think I can escape from the firewall alert: using a named pipe would be a autodiscovery/configuration nightmare.

Anyways, I still need to work on the proxy: refactoring the main loop, better handling on Windows (using a notification when user need to interact with the browser, handling control+c), writing documentation, etc..

JimmXinu commented 3 years ago

Okay. I'm going to leave the proxy code in a branch for now, but I'll try to remember to update it.

Let me know when you're ready for other people to test it and I'll merge it into FFF main.

nsapa commented 3 years ago

I worked on the proxy this morning: https://pix.milkywan.fr/IRNe2RYu.png It behave a lot better: control+c works on Windows, it use cross-platform notification to tell the user that they need to resolve the captcha, it have a requirements.txt, ...

I am writting documentation for Windows & Linux.

JimmXinu commented 3 years ago

Tested with Calibre plugin version on Win10 and Linux successfully using branch proxy. There has been a couple minor fixes to the FFF side.

Did not work on MacOS for me. I had to jump through some hoops to get macos to allow geckodriver to run and then got this error:

$ python3 ff_content.py                       
2021-04-14 13:48:52.764 CDT INFO root fanfictionnet_ff_proxy version 0.2 by Nicolas SAPA <nico@byme.at>
2021-04-14 13:48:52.765 CDT INFO root This Alpha software is licensed under CECILL-2.1
2021-04-14 13:48:52.774 CDT INFO root Running on Darwin-20.3.0-x86_64-i386-64bit
2021-04-14 13:48:59.751 CDT INFO prepare_firefox Firefox 84.0.2 on mac have started (pid = 13954)
2021-04-14 13:48:59.884 CDT INFO prepare_firefox Trying to load existing cookie...
2021-04-14 13:48:59.884 CDT INFO root Firefox is initialized & ready to works
2021-04-14 13:48:59.885 CDT ERROR root Cannot create a TCP server: module 'socket' has no attribute 'create_server'
$ python3 --version
Python 3.7.3

It's not my mac, so I'm a bit limited what I can do on it.

Are you ready for me to put the FFF side out in the public test versions? Or do you want to package this up to install with pip first?

nsapa commented 3 years ago

socket.create_server was added in python 3.8... Do you know from where this python 3.7 come from (brew, macOS, python.org offical dmg)? I reverted to socket.socket + bind() + setsockopt() + listen() and added a warning for people running under Darwin.

Calibre have its own python 3.8 binary (calibre.app/Contents/Frameworks/Python.framework/Versions/3.8/Python) but I am not sure if I should write a script to use it. Anyway, this confirm I need access to a macOS VM to test this.

I am ready to have it on the public test version. The documentation should be available this weekend.

JimmXinu commented 3 years ago

From my brief research, it looks like the python 2.7 on that computer came with MacOS and the Python 3.7 was installed later. It's entirely probably that was me doing FFF CLI testing on a previous occasion.

Regardless, I got permission the owner to update it and installed Python 3.9 from python.org.

ff_content.py started, then initially crapped out on the notification not finding pyobjus. But after manually installing pyobjus with pip, it worked.

JimmXinu commented 3 years ago

FYI, I've posted test versions containing the FFF side code. If you aren't already on the Mobile Read forums, I suggest joining. The FFF plugin thread is the most active FFF user group.

nsapa commented 3 years ago

Let's close this issue since the code is in a released version :)