JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.
Other
758 stars 162 forks source link

Fanfiction.net stories not downloading #622

Closed Katylar closed 3 years ago

Katylar commented 3 years ago

In the logs, I get the error: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

Edocsil commented 3 years ago

What does that require to work? I don't think it will work for example on a phone, which is where I run fanficfare.

Very concerning šŸ¤”

chocolatechipcats commented 3 years ago

I googled it and it seems that

It sounds like an overkill, but maybe we can use selenium?

I might be extrapolating based on information I dont have (I'm not really a programmer, though I do have a bit of background in IT), but if my hypothesis is correct (that this would work in CLI but not the Calibre plugin), could the plugin pass the URL onto the CLI version? Then the CLI downloads and sends the output back to the plugin to update the Calibre.

kido5217 commented 3 years ago

Selenium requires working firefox or chrome that it would run in headless mode. Can't say anything about mobile devices or calibre plugin though.

sidney commented 3 years ago

I was testing selenium too and came here to post about it working to find that you had already done it.

chocolatechipcats, selenium is a python library that starts up a web browser such as Chrome or Firefox and interacts with it. It is designed for automated testing of web browsers. It should work in the Calibre plugin as well as the CLI. It does seem to require that you install an executable which is called a "webdriver" which is called by selenium and in turn runs your browser, Chrome, Firefox, or Safari, depending on which webdriver you have installed. In the case of FanFicFare the browser would be run in headless mode, with no visible window. Since it is a real browser, it can do all the Javascript and local storage tricks that Cloudflare asks it to, and then selenium gets the results and passes them back to the calling program.

I don't know how that can work on Android, but a little googling found people talking about some kind of Selenium support on Android. What I'm not sure about from what I've seen is if the available support would be compatible with the way the FanFicFare CLI was made to work on Android.

JimmXinu commented 3 years ago

I will be investigating selenium; thanks @kido5217 for the suggestion.

But there are limits to how far I'm willing to pursue this. On reflection, I am not comfortable with supporting paid captcha solvers--which doesn't matter, because it doesn't solve the existing problem.

And for those thinking about, the paid version of cloudscraper is not an option for inclusion in FanFicFare for obvious reasons. If any of you choose to pursue that, I ask that you take discussion of it somewhere else.

JimmXinu commented 3 years ago

FYI, selenium does not reliably get past CloudFlare for me.

A single request frequently works, but a second request gets a full 'click here to prove you are human' and 'click all the X' captchas.

chocolatechipcats commented 3 years ago

From what I've noticed (both personal experience and various twitter posts and now your report), ffnet's current CF behaviour seems consistent with the "I'm under attack" mode, e.g. re-running browser checks damn near every time I visit.

The under attack mode isn't supposed to be used on a permanent basis, but considering it's ffnet...

chocolatechipcats commented 3 years ago

Could FlareSolverr work? Unlike selenium (which seems more general purpose), it's specifically for getting past CloudFlare so maybe it'll be more successful. It also appears to be in active development.

sidney commented 3 years ago

FlareSolver says it doesn't handle captchas. I think that requiring installation of docker and running a proxy with a browser on a docker image is a bit too heavy a requirement for a calibre plugin. If ffnet really doesn't want automated downloaders it they have the advantage that anything that gets pages faster than a person reading them online can potentially be flagged for more stringent checking.

Mikaela commented 3 years ago

Would it be possible to somehow show the browser or connect https://privacypass.github.io/ passes to it and thus decrease the amount of captchas a bit after solving one, or would that require Fanficfare to be a browser extension?

roon0 commented 3 years ago

My apologies @JimmXinu I was confused by the heading of this stream. I will do as you suggest. Thank you for help.

I updated FFF and restarted Calibre twice. This is the error message I got

calibre, version 5.9.0 ERROR: Unhandled exception: UnicodeDecodeError:'utf-8' codec can't decode byte 0xa0 in position 5004: invalid start byte

calibre 5.9 embedded-python: True is64bit: False Windows-10-10.0.19041 Windows ('32bit', 'WindowsPE') 32bit process running on 64bit windows ('Windows', '10', '10.0.19041') Python 3.8.5 Windows: ('10', '10.0.19041', '', 'Multiprocessor Free') Interface language: en_GB Successfully initialized third party plugins: Clean Metadata (0, 0, 6) && DOC Input (1, 0, 1) && EpubMerge (2, 11, 0) && FanFicFare (3, 27, 0) && Find Duplicates (1, 8, 3) && SmartEject (2, 3, 0) Traceback (most recent call last): File "calibre_plugins.fanficfare_plugin.fff_plugin", line 572, in get_urls_from_imap_menu File "calibre_plugins.fanficfare_plugin.fanficfare.geturls", line 241, in get_urls_from_imap File "calibre_plugins.fanficfare_plugin.fanficfare.six", line 917, in ensure_str UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 5004: invalid start byte

JimmXinu commented 3 years ago

I looked at FlareSolverr; I didn't think it was likely to work any better than Selenium and it looked more complicated. Plus it suffers from the same problem open source cloudscraper has--an open source CloudFlare by-pass is inherently visible to CloudFlare whenever they want to take the time to block it.

My reading of Privacy Pass is that it still requires captchas, just not as often. And some 40% of the reviews are people saying it doesn't work / make any difference.

@roon0 - your report is unrelated to this issue and off-topic. However, I address it briefly: Something in an email isn't encoded correctly. It's not my problem and right now I'm not interested. Copy the URL(s) from email(s) manually and move on.

atroly commented 3 years ago

Despite these issues, FanFicFare continues to be an invaluable tool as I migrate as much of my reading as possible to ao3. Due to slightly paranoid browser security settings I was quite unaware that Fanfiction.net had any advertising on their site, so they have achieved nothing but adding me to the ever-growing number of users and contributors moving off their platform.

My appreciation and thanks to Jim for producing a tool which has and continues to help so many of us.

chocolatechipcats commented 3 years ago

so they have achieved nothing but adding me to the ever-growing number of users and contributors moving off their platform.

I have actually been PMing a few authors, mentioning that I've not seen their fic crossposted on AO3 and offering them an invite if they want one. I explain my situation about ffnet making it harder and harder to use downloaders (I don't mention FFF by name though) and then also mention the benefits of AO3, such as tagging and series and that the site isn't falling apart at the seams. I started doing this back when ffnet broke the site updates and again when they added Cloudflare, and I've actually had a few authors accept invitations. Just be nice about it, and definitely don't harass them about it.

That's the best long-term solution I can think of.

JimmXinu commented 3 years ago

And if 20 readers each sent the same author only 1 message, it's still going to be a nuisance to the author.

If you want to do that, fine. Please stop using FFF discussion forums as a platform for advocating it for others.

chocolatechipcats commented 3 years ago

fair point. Ill stop.

mcepl commented 3 years ago

Sorry, the last message on this issue. It seems to me as a viable alternative there is Firefox addon https://addons.mozilla.org/en-US/firefox/addon/epub-read-the-web-offline/ which can help to somebody in desperate situation.

sidney commented 3 years ago

It seems to me as a viable alternative there is

There are quite a few viable alternative ways to get a story that is on ffnet into an epub, including browser extensions for Chrome and Firefox, standalone programs, convincing authors to switch to ao3, etc., but those do not pertain to this open issue on GitHub, which is about FanFicFare no longer being able to access the ffnet site. As JimmXinu pointed out CloudFlare can take action to block any open source solution. If nobody comes up with a reasonable workaround for FanFicFare that can work for now, this issue will probably have to be left unsolved. Even if it is solved for the short term, without ffnet making the decision to allow such access, the effort of keeping it working is likely to ultimately fail.

chocolatechipcats commented 3 years ago

without ffnet making the decision to allow such access, the effort of keeping it working is likely to ultimately fail.

and considering that the terms of service kind of disallow downloaders (not explicitly, but it says along the lines of 'you can only access site content through the site itself and not with a scraper or whatever')....I dont see a good way around it if ffnet decides it's an arms race.

chocolatechipcats commented 3 years ago

On the topic of browser extensions though: I wonder if FFF could work as one? Since the user's already gotten past the browser check, FFF could scrape it from the already-open browser and then pass it onto the CLI/calibre plugin. Sort of like Selenium but a bit more user involvement especially since ffnet can only view one chapter at a time. Might be overkill though.

sidney commented 3 years ago

Perhaps using selenium with this method for Chrome, and probably an equivalent debug browser mode in Firefox https://cosmocode.io/how-to-connect-selenium-to-an-existing-browser-that-was-opened-manually/

sidney commented 3 years ago

And here is question and answer about an approach to using selenium when user interaction such as a captcha might be required. It is still automated, but the browser isn't headless and the user could see and respond to a request for interaction. It would require that FanFicFare run a browser with a visible window and wait for something in the browser page that indicates that it has successfully got past the CloudFlare tests and is in the real ffnet page. Unlike the link in my previous comment, this starts out automated, with the user taking control if necessary to get past an interactive step. https://sqa.stackexchange.com/questions/30812/handing-over-manually-logged-in-browser-session-to-webdriver-selenium

mavi0 commented 3 years ago

@sidney I think the problem with this is how to integrate it into Calibre? An hour or so ago when I saw this issue I made a quick Selenium implementation which waits for the user to solve the captcha, and then press enter on the cli (becasue I wanted some stuff to read) and it works fine. Most of the time cloudflare seems to do the "Just a moment page" which it just waits 20 secs for it to pass, and every 20ish chapters or so cloudflare does the "One more step" captcha page. This works as a bodge in the cli, but there's no tidy way to integrate this into Calibre I think.. Here's the bodge šŸ‘‰ https://github.com/mavi0/FanFicFare/commit/f7dbb991cc10a4995aef2216b2b9bbfe80bdfb8a

mozgwar commented 3 years ago

@mavi0 Thanks

sidney commented 3 years ago

@mavi0 I wonder if CloudFlare issues challenges to that bodge so much because it creates and opens a new browser instance with every call to _fetchUrl instead of what happens when cloudscrape is being used where it re-uses the Cloudscrape object's open connection every time. Perhaps the code would be better off being done the same as the call to cloudscraper, i.e., where configurable.py checks if the cloudscraper option is selected, have it make a Selenium object instead of a Cloudscaper object that has the same methods. What if instead of looking for Cloudflare in driver.title (BTW, is the "Coludflare" in the bodge a typo? And is it Cloudflare or CloudFlare?) and then waiting for user input, you instead have it wait for an element that will be on the page after it gets through any delay and challenges that CloudFlare presents. (Don't use a headless browser). For details on how to do that, see the sections on ImplicitlyWait and FluentWait at https://www.toolsqa.com/selenium-webdriver/wait-commands/

chocolatechipcats commented 3 years ago

Officially, it's "Cloudflare," not capitalizing the F.

mavi0 commented 3 years ago

Mmhm, honestly this was just a low effort 10 min bodge just so I could download a couple of fics, it'd be nicer to do it properly but again I don't see selenium + Firefox/chrome being implemented nicely with calibre? I figured some other people may want a quick workaround to download some fics from ffn while they've hopefully temporarily set Cloudflare to "under attack" mode. And yep, I just copy pasted Cloudflare from the title tag on the captcha page

dlehman83 commented 3 years ago

@mavi0 The bodge is returning element not found for me. Personally I don't bulk update stories. Usually only one maybe two or three if its a series. I'd like to see selium use an existing browser session with cookies etc I think that would be the best.
Your code must be specifically looking for the gecko driver, I tried with chrome first and it complained about the path.
It also opened FF and closed it about 3 times. If it could keep the browser open until done I think that would help with Cloudflare.

I'll dig into the code tomorrow.

LoisGNS commented 3 years ago

Just started getting download errors tonight, though I've been getting Cloudflare screens when going to FFnet in the browser for at least a few days.

kido5217 commented 3 years ago

I've poked selenium some more. It works for me with multiple chapters and stories if I create new "browser" object with random UserAgent for each chapter. Here's my code:

#!/usr/bin/env python3

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from random import randint
from selenium import webdriver
from time import sleep
from urllib.parse import urlparse

FFULRS = [
    'https://www.fanfiction.net/s/13586946/1/Sons-and-Daughters-of-Sineya',
    'https://www.fanfiction.net/s/13644134/1/The-Black-Family-s-PR-Nightmare',
    'https://www.fanfiction.net/s/8560965/1/God-Slaying-Blade-Works'
]

def gen_chapter_url(url, chapter):
    parsed = urlparse(url)
    split_path = parsed.path.split('/')
    split_path[3] = str(chapter)
    parsed = parsed._replace(path='/'.join(split_path))
    return parsed.geturl()

def on_ffn(soup_obj):
    # Check if fanfiction site was loaded and not cloudfare page
    ffn = soup_obj.find_all('a', string='FanFiction')
    if len(ffn) > 1:
        return True
    else:
        return False

def get_browser():
    ua = UserAgent()
    userAgent = ua.random

    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("--headless")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument(f'user-agent={userAgent}')

    browser = webdriver.Chrome(options=options)

    return browser

def get_chapter(chapter_url):
    browser = get_browser()

    browser.get(chapter_url)
    # Wait for CF
    sleep(randint(4, 7))
    html_raw = browser.page_source
    browser.quit()

    soup = BeautifulSoup(html_raw, 'html.parser')
    # Throttle
    sleep(randint(4, 7))
    return on_ffn(soup)

def get_story(url):
    print(url + ':')

    browser = get_browser()
    browser.get(url)
    # Wait for CF
    sleep(randint(4, 7))

    html_raw = browser.page_source
    browser.quit()

    soup = BeautifulSoup(html_raw, 'html.parser')
    # Are we on ffn or cf?
    if on_ffn(soup):
        print(' FFN opened: OK')
    else:
        print(' FFN opened: FAIL')
        return

    # Get chapter count
    chapters_selector = soup.find_all('select', id='chap_select')
    chapters_number = int(list(chapters_selector[0].children)[-1]['value'])

    # Throttle
    sleep(randint(4, 7))

    # Get chapters
    for chapter in range(1, chapters_number + 1):
        chapter_url = gen_chapter_url(url, chapter)
        chapter_status = get_chapter(chapter_url)
        if chapter_status:
            print(f' Chapter {chapter} of {chapters_number}: OK')
        else:
            print(f' Chapter {chapter} of {chapters_number}: FAIL')
            return

for url in FFULRS:
    get_story(url)
dlehman83 commented 3 years ago

I was able to get a story downloaded. I needed to increase the sleep time. Although this made it take quite a while to download. It opened and closed the browser for every chapter.
I then started working with the chrome version. Gecko always gives a cloud flare checking browser page, chrome did not.
I did find how to use an existing profile, but it is hard coded at the moment. There is also a web service. Iā€™m hoping the combination of these two will allow a reusable profile to keep the cloud flare checks down and speed up the process.

mcepl commented 3 years ago

@kido5217 I have put your code on https://sr.ht/~mcepl/get_ffn_story_selenium/ .

JimmXinu commented 3 years ago

FYI, at this point I am not seeing any immediately viable solutions. I have a couple off-the-wall ideas I'm thinking about, but nothing ready to share. I've basically given up on finding any quick fixes for the existing code.

As for ffnet dropping their blocking level down, I'm not holding my breath. I've always assumed FFF was a small enough community to disappear in the noise, but it's entirely possible that we are the reason they raised it in the first place.

@mcepl said:

It seems to me as a viable alternative there is Firefox addon https://addons.mozilla.org/en-US/firefox/addon/epub-read-the-web-offline/ which can help to somebody in desperate situation.

This addon is named EpubPress and is also available for Chrome. It worked when I tried it, but you have to have each chapter you want to include open in a tab.

AndyScull commented 3 years ago

Just a thought, I know it may be too much effort for a single site, but maybe you could add a 'import chapter' functionality where instead of checking and downloading the chapters, FFF instead would get data from user input? I mean, we copy the page source from browser manually and paste it into FFF for parsing or save the page in browser and point FFF to the file.

themaster567 commented 3 years ago

FYI, at this point I am not seeing any immediately viable solutions. I have a couple off-the-wall ideas I'm thinking about, but nothing ready to share. I've basically given up on finding any quick fixes for the existing code.

As for ffnet dropping their blocking level down, I'm not holding my breath. I've always assumed FFF was a small enough community to disappear in the noise, but it's entirely possible that we are the reason they raised it in the first place.

@mcepl said:

It seems to me as a viable alternative there is Firefox addon https://addons.mozilla.org/en-US/firefox/addon/epub-read-the-web-offline/ which can help to somebody in desperate situation.

This addon is named EpubPress and is also available for Chrome. It worked when I tried it, but you have to have each chapter you want to include open in a tab.

I have a far better alternative. Use Web2Epub. It's on Firefox and Chrome and is much more geared towards our use case. You have to fill in some boxes to point it in the right direction, but it will assemble an ebook very similar to how we do it, and it will work on just about any site if you put the effort in and are willing to possibly need to clean up the files a bit.

JimmXinu commented 3 years ago

That's very interesting--it's similar to one of the ideas I was considering, I think.

And Web2Epub is also open source at https://github.com/dteviot/WebToEpub

I've only tried one story with it so far, but I will definitely be looking into that.

MarqFJA87 commented 3 years ago

That's very interesting--it's similar to one of the ideas I was considering, I think.

And Web2Epub is also open source at https://github.com/dteviot/WebToEpub

I've only tried one story with it so far, but I will definitely be looking into that.

By "looking into that", do you mean "see if FFF can be reworked by implementing Web2Epub's code"?

chocolatechipcats commented 3 years ago

Just a thought, I know it may be too much effort for a single site, but maybe you could add a 'import chapter' functionality where instead of checking and downloading the chapters, FFF instead would get data from user input? I mean, we copy the page source from browser manually and paste it into FFF for parsing or save the page in browser and point FFF to the file.

I thought something like that would be useful but wasn't sure of how to phrase it.

dastrdly6585 commented 3 years ago

One thing to add about Web2Epub is that the download will stall due to Cloudflare wanting a captcha solve if you download a story with 50+ chapters (in the two long stories with 70+ chapters I tried, I triggered the captcha at Chapter 54 and 57). If the captcha is solved in a separate tab by opening a page of the story, after about 30 seconds Web2Epub will continue the download without throwing an error and the ebook is assembled without issue.

I haven't tried downloading a story with more than 100 chapters so the captcha request may be repeated every 50-ish chapters. The likely solution would be to add a slow down sleep time, since Web2Epub essentially has none and the repeated fast requests is probably what's triggering the captcha.

chocolatechipcats commented 3 years ago

I've not got a source for this other than my own shoddy memory and Google isn't helping, but I seem to recall reading somewhere that a randomized sleep time may be more effective for avoiding triggering Cloudflare's "this may be a bot!" alarms than a long one. This was several years ago, so perhaps things have changed since.

JimmXinu commented 3 years ago

By "looking into that", do you mean "see if FFF can be reworked by implementing Web2Epub's code"?

In a word, no. FFF is a python tool that runs on your computer (or in Calibre) and fetches web pages itself. Web2Epub is a javascript add-on for your browser that runs within the browser. Different languages, in different environments, with different limitations.

Right now, I am mostly considering whether it's a good solution for my own reading. I'm fairly impressed on initial review.

I have some ideas ideas I'm considering, but I'm not planning to rush anything out for a while.

I'm contemplating releasing a new version with ffnet removed or disabled just to remove the traffic of an unknown number of users trying over and over again.

themaster567 commented 3 years ago

I'm contemplating releasing a new version with ffnet removed or disabled just to remove the traffic of an unknown number of users trying over and over again.

I'd say that's for the best. You'll likely get a sizable influx of people here wondering what's going on, but at least they'll have a real answer.

chocolatechipcats commented 3 years ago

I'm contemplating releasing a new version with ffnet removed or disabled just to remove the traffic of an unknown number of users trying over and over again.

At the very least it'll clear up the confusion about what the the "not available in the free version" error refers to.

tartpvule commented 3 years ago

I'm contemplating releasing a new version with ffnet removed or disabled just to remove the traffic of an unknown number of users trying over and over again.

Instead of removing the adapter outright, maybe a switch or a personal.ini setting to enable it if the user desires to? Kind of sad to see the removal of a good piece of work. :cry: :sob:

On another note, the arms race is on. Saw new things: __CF$cv$params in page's source and mentions of __selenium_unwrapped. Seems to be related to detection of Selenium. Any public/open source solutions would be countered quickly for sure. :angry: Disabling the adapter with a switch to enable would allow I and probably others to experiment further.

chocolatechipcats commented 3 years ago

What would be the point of leaving it in? It doesn't work. If the Cloudflare restrictions get relaxed or Jim figures out another method, he can put it back in.

JimmXinu commented 3 years ago

No, I don't think a switch would be a good idea--the whole idea is to get people to stop trying. Frankly, if you can't hack the code, you shouldn't be trying at this point.

Right now I'm thinking of throwing a new exception from adapter_fanfictionnet.py before any requests are made. I'm just waffling about putting it out right away, or waiting until Monday in case a miracle happens.

On another note, the arms race is on.

Not for me it isn't. I have little interest in competing head to head with a company in their own field.

chocolatechipcats commented 3 years ago

On another note, the arms race is on. Saw new things: __CF$cv$params in page's source and mentions of __selenium_unwrapped. Seems to be related to detection of Selenium. Any public/open source solutions would be countered quickly for sure. šŸ˜  Disabling the adapter with a switch to enable would allow I and probably others to experiment further.

also, where are you seeing the _selenium_unwrapped? I checked ffnet's page source myself (both the main page and a random story) and CTRL+F did not find that.

chocolatechipcats commented 3 years ago

or waiting until Monday in case a miracle happens.

Personally, i would wait. FictionPress made some tweets as of 15-19 hours ago that indicate that they may still be working on...something (the "investigating Opera mini problems" one was never resolved): https://twitter.com/FictionPress

kov9413tam commented 3 years ago

I just now remembered that before I discovered the FanFicFare I used the FanFiction Downloader. I tested it about 10 minutes ago and it's working. I set it to wait 20 seconds between chapters and it dowloaded a 108 chapter long story in one go.

https://www.fanfictiondownloader.net/#/download

chocolatechipcats commented 3 years ago

Yes - FFDL's probably going to be my method for updating ffnet stories that haven't been crosspost. I grab the last chapter, use HTML output, then import it into the ePub file. Then try to remember to update the Calibre metadata.