JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.
Other
750 stars 161 forks source link

Story Ids/URLs for StoriesOnline and similar websites #882

Closed cookie-crumb closed 1 year ago

cookie-crumb commented 2 years ago

For SOL, the Ids saved to calibre is url:https://domain/s/number

However, the story number is susceptible to be reused by a new story if the old story is deleted. That old form url is still working to keep the deep link alive.

The real urls to keep however, is https://domain/s/number/story-slug Should the story be deleted, and the number reused, You will see a deletion warning instead of the new story. It will also make a best effort to relocate the story based on the story-slug.

JimmXinu commented 2 years ago

There's already a solution for this from couple years ago. Have you tried setting append_datepublished_to_storyurl:true?

[storiesonline.net]
# ...
## append_datepublished_to_storyurl literally appends
## datePublished(-%Y-%m-%d) to storyUrl.  This is an ugly kludge to
## (hopefully) help address the site's unfortunately habit of
## *reusing* storyId numbers.  Off by default to *not* cause weirdness
## for those not expecting it.
#append_datepublished_to_storyurl:false

Discussion of append_datepublished_to_storyurl feature: https://www.mobileread.com/forums/showthread.php?p=4072550&postcount=5010 https://www.mobileread.com/forums/showthread.php?p=4072572&postcount=5015 https://www.mobileread.com/forums/showthread.php?p=4072789&postcount=5026 https://www.mobileread.com/forums/showthread.php?p=4073270&postcount=5049 https://www.mobileread.com/forums/showthread.php?p=4073297&postcount=5052

cookie-crumb commented 1 year ago

I tried append_datepublished_to_storyurl:true I see the change in the Ids in the calibre meta. It solve one problem: when pasting the reused url, the story is not overwritten anymore and instead added as a new story.

Unfortunately, it doesn't prevent a wrong update.

Reuse of id is the new normal for SOL, it has been weeks since a new id hasn't be reused. Renaming a story slug is much less frequent.

So if you are in the habit of forcing an update (to get the last typo corrections) before starting reading, there is a good chance you will lose the story if the author as removed it for any reason.

To simulate the problem: Change the story number in the id meta for a story with more chapter and do an update. The old story is overwritten by another unrelated despite the wrong slug.

It looks like FanFicFare remove the slug and normalize the download link to https://domain/s/number So any string added as no effect.

If you look at https://domain/s/number/story-slug and https://domain/s/number both page are identical and are either the story or the story index page or the last read chapter.

However, https://domain/s/number/some-random-string Is a different page containing either a proposed link to https://domain/s/number/story-slug or a warning of deletion.

Only sensible way around that is to save to the ids the urls pasted for download: https://domain/s/number/story-slug

Then use that as update link.

and fail the download of page displayed as
https://domain/s/number/some-random-string with a warning of story renamed, deleted, or replaced by another.

At that stage, only a manual inspection of the epub and the website could determine is the ids need to be adapted or the update should be permanently abandoned.

There is however a problem: https://domain/s/number:i/story-slug and https://domain/s/number:i/bad-slug

Both display the story index page, and https://domain/s/number/story-slug may show the current reading progress instead of the index.

So a call to https://domain/s/number/story-slug is the first step, There abort if it is not a story page Then determine if a call to https://domain/s/number:i is needed

Granted, my proposed changes would prevent update to link saved with the option append_datepublished_to_storyurl:true unless you decide to scrap the added date for updates. There is no way to confuse both format, as one use number-YYYY-MM-DD and the other use number/story-slug

JimmXinu commented 1 year ago

I'm not disagreeing this is an issue, but it's not a new one. It's been happening for almost two years.

You haven't said if you are a CLI or plugin user; as a general rule, I assume plugin.

Calibre Library Matching

Changing the story URL schemes for storiesonline.net / finestories.com / scifistories.com like this, they will not match existing stories in plugin users' libraries by identifier:url.

With the Search by Title/Author(s) for If Story Already Exists? feature enabled, this should (hopefully) not be hugely painful, but it's worth noting for the users who don't keep that feature on.

Chapter URLs

Another aspect is chapter URLs. For most sites, including SOL/etc, FFF accepts chapter URLs when it can.

SOL chapter URLs have a chapter title embedded, but from what I've seen, what you're calling the story-slug cannot be reliably extracted from a chapter URL. So while before, a chapter URL would yield the canonical story URL, with this change, FFF will initially be looking up a 'no-slug' story URL.

This is a bit of an odd corner case, really. Hopefully nobody will be trying to download with a chapter URL from a deleted story.

cookie-crumb commented 1 year ago

Another way to attempt correcting the problem could be to first visit the storyInfo page /library/storyInfo.php?id=number Formerly, it was reserved to premium member, it seems to be available with no login. With the information available on that page, there is no more need to access the author page or to change the listing to a classic view. A mismatch in the author name on the storyInfo should be a reason sufficient to stop the update.

JimmXinu commented 1 year ago

I'm not interested in rewriting this adapter to that extent. Especially for something they might put behind the pay wall again.

If you want to submit a PR for it, I will consider it.

I've put up a plugin test version with a change for this.

cookie-crumb commented 1 year ago

One major problem with the commit: it errors on download of stories not yet in the library. Otherwise, it solve the id reuse issue better than I expected. see my comment on the commit itself.

JimmXinu commented 1 year ago

I'm not seeing that.

What's the debug output look like? Run Calibre in debug mode from the Calibre Preferences drop down.

Also, FYI, append_datepublished_to_storyurl is removed.

cookie-crumb commented 1 year ago

The error can be tracked to reading in progress. (You need a premium account and also have the story you want to download in your website library to replicate that.) Downloading the url: https://domain/s/number/story-slug, it redirected to https://domain/s/number:chapter. FanFicFare then fetched https://domain/s/number:chapter:i note that the :i has no effect on that url. So the site adapter tried to find an h1 on the wrong page.

Instead of https://domain/s/number:chapter:i it should have fetched https://domain/s/number:i

Also note that SOL and the others current way to go back to the index page is: https://domain/s/number/story-slug?ind=1 but https://domain/s/number:i is still working.

Traceback (most recent call last):
  File "calibre_plugins.fanficfare_plugin.dialogs", line 704, in do_loop
  File "calibre_plugins.fanficfare_plugin.fff_plugin", line 1343, in prep_download_loop
  File "calibre_plugins.fanficfare_plugin.fff_plugin", line 1215, in get_story_metadata_only
  File "C:\Users\gvvwa\AppData\Roaming\calibre\plugins\FanFicFare.zip\fanficfare\adapters\base_adapter.py", line 320, in getStoryMetadataOnly
  File "C:\Users\gvvwa\AppData\Roaming\calibre\plugins\FanFicFare.zip\fanficfare\adapters\adapter_storiesonlinenet.py", line 192, in doExtractChapterUrlsAndMetadata
  File "C:\Users\gvvwa\AppData\Roaming\calibre\plugins\FanFicFare.zip\fanficfare\htmlcleanup.py", line 71, in stripHTML
AttributeError: 'NoneType' object has no attribute 'get_text'
JimmXinu commented 1 year ago

(I assume you are sure you're not bumping into the same problem as #871)

You're saying that for premium users, the site auto-redirects the story URL to a chapter URL, probably the latest / your current chapter?

Isn't that the same point that we are now depending on to give correct the story-slug if we only have a number URL? Fun.

All the chapters I see have <a href="/s/number/story-slug" rel="bookmark"> as a unique link back to the story URL. I've added code to look at the redirected URL and if it looks like a chapter (matches r".*/s/\d+:\d+/.*"), then look for and a tag with rel="bookmark" and use it's href as the story URL.

I've also changed the TOC fetch to https://domain/s/number/story-slug?ind=1 in the hopes it will prevent the redirect to chapter.

I don't have a premium membership for those sites and I'm not interested in getting one, so I'm working blind here.

I've put up a plugin test version with a change for this.

cookie-crumb commented 1 year ago

The login attempt in the code came too late in the process, it should be in the first try probably with redirect.

After the login, it attempt to load: https://domain/s/number/story-slug:i instead of https://domain/s/number/story-slug?ind=1 That still redirect to a chapter page and continue to fail.

I tried to modify the .py inside the zip to test, but could not create a valid plugin.

JimmXinu commented 1 year ago

As a general rule, FFF only tries to login when it has to. There are sites that use the additional setting always_login, but we usually use that when the site doesn't differentiate between 'you need to login' and 'story does not exist'.

Can you confirm that ?ind=1 instead of :i does not redirect to chapter?

Because if we have no way to reliably get the ToC/story index for premium accounts, this is never going to work.

JimmXinu commented 1 year ago

Here is a test version that also uses ?ind=1 after login. FanFicFare.zip

cookie-crumb commented 1 year ago

?ind=1 is what is found in the story url of the chapter once the user is logged on! <a href="/s/number/story-slug?ind=1" rel="bookmark">Story Title</a> The ?ind=1 is only present when it is needed to reach the index.

That last fanficfire.zip worked in the case of the redirected index! I'll test it more for the other cases However, the sequence of fetch seems bad:

========== MISS (GET) BasicCache
https://domain/s/number/story-slug
---------- REQ (GET) RequestsFetcher
https://domain/s/number/story-slug
**** LOGIN TEST SHOULD BE DONE HERE****
========== MISS (GET) BasicCache
https://domain/s/number/story-slug?ind=1
---------- REQ (GET) RequestsFetcher
https://domain/s/number/story-slug?ind=1
**** The above is fetched without login, so it's useless ****
========== MISS (GET) BasicCache
https://domain/sol-secure/login.php
---------- REQ (GET) RequestsFetcher
https://domain/sol-secure/login.php
========== MISS (POST) BasicCache
https://login.wlpc.com/index.php
---------- REQ (POST) RequestsFetcher
https://login.wlpc.com/index.php
========== MISS (GET) BasicCache
https://domain/s/number/story-slug?ind=1
---------- REQ (GET) RequestsFetcher
https://domain/s/number/story-slug?ind=1
========== MISS (GET) BasicCache
https://domain/a/author-slug/1
---------- REQ (GET) RequestsFetcher
https://domain/a/author-slug/1
========== MISS (GET) BasicCache
https://domain/series/series_number/series-slug
---------- REQ (GET) RequestsFetcher
https://domain/series/series_number/series-slug
...

As you can see, the login is too late in the process.

Also for reference, on some stories, the chapter urls have changed an optional chapter title is post-fixed to the url: https://domain/s/story_number:chapter_number/chapter-slug

So, I believe, the good process would be: load the given url given by the user (story index or chapter url), test if login is required Reload after login if login was needed. Get the bookmark href it will have the correct url in all case after the login:

JimmXinu commented 1 year ago

I've never seen a ?ind=1 URL on those sites. As you pointed out, your membership is different.

That suggests that the 'take story URL from chapter rel=bookmark' code would lead to URLs with ?ind=1?ind=1? Are you seeing that?

I'm not interested in attempting to optimize this adapter, especially when I can't test all the cases. Double so since it supports 3 different sites that are only mostly identical.

cookie-crumb commented 1 year ago

I finally managed to update the zip locally...

No, I don't see ?ind=1?ind=1 because there is still a bug in the code. 1 The login is too late, so the ?ind=1 isn't there. After correcting that, 2 the regex fail, it should be: ".*\/s\/\d+:\d+\/?.*" and not ".*/s/\d+:\d+/.*" (The redirect is done without the chapter slug and lack a trailling /)

here is a working code, I removed all trace of ?ind=1 and fetch it from bookmark.

    ## Getting the chapter list and the meta data, plus 'is adult' checking.
    def doExtractChapterUrlsAndMetadata(self, get_cover=True):
        url = self.url
        logger.debug("URL: "+url)

        try:
            ## Hit story URL to check for changed title part -- if the
            ## title has changed or (more likely?) the ID number has
            ## been reassigned to a different title, this will fail.
            (data,url) = self.get_request_redirected(url)
            if self.needToLoginCheck(data):
                # need to log in for this one.
                self.performLogin(url)
                (data,url) = self.get_request_redirected(url,usecache=False)
            logger.debug("redirect url:%s"%url)
            if re.match(r".*\/s\/\d+:\d+\/?.*",url): 
                # A chapter instead of index page.  Reported in #882,
                # premium users sometimes redirected to chapter?
                logger.debug("Looking for story URL after redirected to chapter?")
                soup = self.make_soup(data)
                a = soup.find('a',rel="bookmark")
                url = 'https://'+self.host+a['href']
            logger.info("use url: "+url)
        except exceptions.HTTPErrorFFF as e:
            raise exceptions.FailedToDownload("Page Not Found - Story ID Reused? (%s)" % url)

        try:
            data = self.get_request(url)
            self._setURL(url) ## To include /title-in-url
            # logger.debug(data)
        except exceptions.HTTPErrorFFF as e:
            if e.status_code in (401, 403, 410):
                data = 'Log In' # to trip needToLoginCheck
            else:
                raise e

        if "Access denied. This story has not been validated by the adminstrators of this site." in data:
            raise exceptions.AccessDenied(self.getSiteDomain() +" says: Access denied. This story has not been validated by the adminstrators of this site.")
        elif "Error! The story you're trying to access is being filtered by your choice of contents filtering." in data:
            raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Error! The story you're trying to access is being filtered by your choice of contents filtering.")
        elif "Error! Daily Limit Reached" in data or "Sorry! You have reached your daily limit of" in data:
            raise exceptions.FailedToDownload(self.getSiteDomain() +" says: Error! Daily Limit Reached")
cookie-crumb commented 1 year ago

r".*/s/\d+:\d+/?.*" works too, apparently no need to escape / in Python but the ? was missing and is needed

JimmXinu commented 1 year ago

I admit I did not understand that you meant the redirect to chapter URL for premium didn't happen before login. Again, programming blind.

Your code ignores the 401, 403 & 410 codes that currently can trigger the need for login. I don't think we can guarantee those aren't still needed.

I've also just found that the site does not redirect from https://domain/s/number to https://domain/s/number/story-slug when already logged in (with a non-premium account).

Since FFF remembers your cookies & cache between downloads in the same session, we can't trust the redirect to give us the story-slug version of the story URL because it won't past the first story.

Change to taking from rel="bookmark" for all stories.

Here's a test version FanFicFare.zip

cookie-crumb commented 1 year ago

It seems to work for me, But I can't test with a regular account: There is a protection in place against the use of multiple accounts.

         if "?ind=1" not in url:
            url = url+"?ind=1"

Unless you use this to cache bust the non logged in original url request, It should be commented out. If it is needed it is already present from the the rel="bookmark". It make no sense to add it for stories with no index or free accounts.

Also, should the webmaster change his method to return to the index, there is a good chance simply using the rel="bookmark" will still work.

Removing ?ind=1 from the url saved to the ids afterwards as you did is probably a good idea.

A bit of optimization can be done to limit the soup calls, the number of requests, and handle future nav change I suggested to the webmaster, I'll follow up after testing it.

In the score keeping on the current layout '-' has been replaced by '*', but that isn't used by Calibre. Should I file another bug?