FicHub / fichub.net

web frontend for generating ebooks from fanfic
https://fichub.net
GNU Affero General Public License v3.0
45 stars 2 forks source link

Wrong Author Association #7

Closed andreas-kupries closed 3 years ago

andreas-kupries commented 3 years ago

The SB story https://forums.spacebattles.com/threads/harry-and-the-shipgirls-goblet-of-feels-a-hp-kancolle-snippet-collection.772633 belongs to and is written by Harry Leferts.

fichub believes that the author is CV12Hornet instead.

I suspect that this misidentification comes from the staff post by CV12Hornet stickied to and shown at the top of each page of the thread.

iridescent-beacon commented 3 years ago

Thank you for the report; that is definitely the case. Was not aware that mods could sticky posts above the actual author, fun.

I've made a backend change that should fix this for SB, SV, and QQ going forward, but it's possible other XenForo instances will need tweaking. It now tries finding the author by looking at any a.username links in in the thread title section first.

I've forced a refresh of that fic and it is now showing the correct author. The background updater should eventually fix any other stale metadata with issues, but I've kicked off a scrub of the existing fics to speed it along (though it will still take a while to hit everything). If you see any others with incorrect metadata please let me know so I can verify that it's simply still pending and not broken in a different way.

The XenForo parser desperately needs cleanup, still hoping to get the free time to tackle that before the heat death of the universe :)

andreas-kupries commented 3 years ago

Thank you for the report; that is definitely the case. Was not aware that mods could sticky posts above the actual author, fun.

Yeah. It is the only thread I am aware of (i.e I am reading) where the mods went that far to make sure that no poster make the excuse that they did not know the rules governing the thread.

I've made a backend change that should fix this for SB, SV, and QQ going forward, but it's possible other XenForo instances will need tweaking. It now tries finding the author by looking at any a.username links in in the thread title section first.

I've forced a refresh of that fic and it is now showing the correct author.

I confirm for my script

NOTE [312]: Query https://forums.spacebattles.com/threads/harry-and-the-shipgirls-goblet-of-feels-a-hp-kancolle-snippet-collection.772633 (CV12Hornet: Harry And The Shipgirls: Goblet of Feels (A HP/Kancolle Snippet Collection)) ...
WARN [312]: Changed
WARN [312]:   * SB CV12Hornet: Harry And The Shipgirls: Goblet of Feels (A HP/Kancolle Snippet Collection)
WARN [312]:   - Author  : Harry Leferts (was: CV12Hornet)
NOTE [312]: Saving to ~/.fichub/epub/Harry_Leferts/312.Harry_And_The_Shipgirls:_Goblet_of_Feels_(A_HP_Kancolle_Snippet_Collection).epub
NOTE [312]: Getting.. /cache/epub/p7Dz7aox/Harry_And_The_Shipgirls_Goblet_of_Feels_A_HP_Kancolle_Snippet_Collection_by_Harry_Leferts-p7Dz7aox.epub?h=e12f837902fc00eaa808a0118f980c1e

The background updater should eventually fix any other stale metadata with issues, but I've kicked off a scrub of the existing fics to speed it along (though it will still take a while to hit everything).

Out of curiosity, how many stories are in your database/cache by now and what is the length of a full update cycle ?

If you see any others with incorrect metadata please let me know so I can verify that it's simply still pending and not broken in a different way.

Sure. So far this was the only one. And only seen by accident because I grepped by author and one of the expected stories did not show up in the result.

The XenForo parser desperately needs cleanup, still hoping to get the free time to tackle that before the heat death of the universe :)

iridescent-beacon commented 3 years ago

Great! Thank you for confirming.

Out of curiosity, how many stories are in your database/cache by now and what is the length of a full update cycle ?

The live "production" database has 168,882 fics in it currently, of which only 8,383 are from SB/SV/QQ. That backs into an archival database that aims to cover entire sites but isn't completely integrated, but does have ~8.2M fics from ffn for example.

At the outside the background updater should check any ongoing fics once every two months, but checks much more frequently based on when the fic last updated.

I'm not sure how long the scrub will take; I'll report back when it's finished. I expect it to take a day or so. It's just a quick and dirty script that reprocesses the fics in serial -- both metadata and content which is slow. It could certainly be much faster before hitting hardware limitations.

It'd be much better to have a stale marker that forces a refresh before an api response so no users see stale data while the scrub is running, but the current codebase was largely written without users in mind -- figured it'd just be me :p It's very much a work in progress :)

andreas-kupries commented 3 years ago

Great! Thank you for confirming.

It is no trouble at all. With my own old script for checking FFn broken with their changes it really is best to support those still working on such. I know, while I could look into your code to see how you convince FFn to divulge the data and use that to fix my code, right now (and since January), the motivation is simply not there (**). And then there is the fact that my stuff was FFn only and I never got up to handling SV, SB, etc. when I started to follow stories there also. So your work is IMHO strictly better than mine anyway. :+1:

Out of curiosity, how many stories are in your database/cache by now and what is the length of a full update cycle ?

The live "production" database has 168,882 fics in it currently, of which only 8,383 are from SB/SV/QQ. That backs into an archival database that aims to cover entire sites but isn't completely integrated, but does have ~8.2M fics from ffn for example.

Wow. My new database (sqlite) currently only has 368 entries, with 270 marked as (possibly) active. I should really import the pseudo-database (structured text file, essentially runnable as Tcl script) of the old script into my new. That would add around 2700 and change as active. And around 3400 and change complete I should make a second copy of.

And it reminds of another thing I will make a new ticket for, something to consider for the long term.

At the outside the background updater should check any ongoing fics once every two months, but checks much more frequently based on when the fic last updated.

Sensible to have a backoff to de-prio stories with slow updates.

I'm not sure how long the scrub will take; I'll report back when it's finished. I expect it to take a day or so. It's just a quick and dirty script that reprocesses the fics in serial -- both metadata and content which is slow. It could certainly be much faster before hitting hardware limitations.

Ok, and no worries.

It'd be much better to have a stale marker that forces a refresh before an api response so no users see stale data while the scrub is running, but the current codebase was largely written without users in mind -- figured it'd just be me :p It's very much a work in progress :)

Some as for my old script. Quick and dirty Tcl to run through the stories and check each if "last chapter + 1" exists, and if yes, retrieve it. That kind of check detected deleted stories as well. Did not detect stories becoming shorter, i.e. chapters removed and re-uploaded with less. No detection of changed chapters in the middle either (*). No epub (Not sure if the format existed when the script was first written), just the raw html stored in a directory, split by author and story ids.

(*) I wanted it to run fast, and the load on FFn light. ... Time was roughly 10 to 20 minutes per run. Of course, see above, I did not have that many stories to check as you have to. Thinking about it again now, likely would have been better to just pull the author pages and then compare the current data about their stories against the data from the last pull to see any story with number of chapters changed, or a changed update date. Then keep a history of the last 10 update dates (or more) to make a prediction of update rate and when to expect the next ... Ok, I am getting carried away here, I believe.

(**) Last good run was Dec 11 last year. Then I came Dec 20 out of my move, and it was broken.

iridescent-beacon commented 3 years ago

It is no trouble at all. With my own old script for checking FFn broken with their changes it really is best to support those still working on such. I know, while I could look into your code to see how you convince FFn to divulge the data and use that to fix my code, right now (and since January), the motivation is simply not there (**). And then there is the fact that my stuff was FFn only and I never got up to handling SV, SB, etc. when I started to follow stories there also. So your work is IMHO strictly better than mine anyway.

I appreciate it :) I completely understand about the motivation. I've been in a low motivation rut for a while, but even when there is time and motivation there are often more pressing things than something ugly that does work. When I first setup fichub.net (then fic.pw) it was as a temporary stand-in for the now defunct omnibuser.com -- there was enough interest at the time that I really thought someone else would take over, but nothing ever came of it. While FanFicFare is still the 800lbs gorilla, there are no hosted alternatives I'm aware of. There are vague ideas about falling back on FanFicFare for unsupported sites, but I'm not sure how easy it will be to extract metadata from and I'm queasy about allowing arbitrary urls.

The live "production" database has 168,882 fics in it currently, of which only 8,383 are from SB/SV/QQ. That backs into an archival database that aims to cover entire sites but isn't completely integrated, but does have ~8.2M fics from ffn for example.

Wow. My new database (sqlite) currently only has 368 entries, with 270 marked as (possibly) active. I should really import the pseudo-database (structured text file, essentially runnable as Tcl script) of the old script into my new. That would add around 2700 and change as active. And around 3400 and change complete I should make a second copy of.

It grows fast when it's more than one person :) I've only read maybe 3k of them myself, and IIRC there were only 10k fics in the live db before it was opened up to other users -- mostly from ingesting recommendation lists.

It'd be much better to have a stale marker that forces a refresh before an api response so no users see stale data while the scrub is running, but the current codebase was largely written without users in mind -- figured it'd just be me :p It's very much a work in progress :)

Some as for my old script. Quick and dirty Tcl to run through the stories and check each if "last chapter + 1" exists, and if yes, retrieve it. That kind of check detected deleted stories as well. Did not detect stories becoming shorter, i.e. chapters removed and re-uploaded with less. No detection of changed chapters in the middle either (*). No epub (Not sure if the format existed when the script was first written), just the raw html stored in a directory, split by author and story ids.

Aye, those same issues affect fichub. I've forced a refresh on a few instances users have reported. Not sure I want to jump straight to refetching all chapter content anytime any metadata changes. Maybe a constant rate refetch task so I can roughly plan resource usage, with something to bump fics with metadata changes to the top of the queue.

(*) I wanted it to run fast, and the load on FFn light. ... Time was roughly 10 to 20 minutes per run. Of course, see above, I did not have that many stories to check as you have to. Thinking about it again now, likely would have been better to just pull the author pages and then compare the current data about their stories against the data from the last pull to see any story with number of chapters changed, or a changed update date. Then keep a history of the last 10 update dates (or more) to make a prediction of update rate and when to expect the next ... Ok, I am getting carried away here, I believe.

Agreed; I'm planning on transitioning to that myself. The archival updater works on fandom search pages with a healthy slop factor to account for ffn's eventual consistency there.

I already have code to read metadata from author pages for other workflows, but need to use it in the main background updater. Thought it'd be interesting to do some greedy minimal covering set since author pages can have story metadata from other authors too, but in reality it's probably not going to be much of an improvement over something more naive. It's easy to get carried away ;)

iridescent-beacon commented 3 years ago

I'm not sure how long the scrub will take; I'll report back when it's finished. I expect it to take a day or so.

This finished ages ago, on the 1st. Took almost four days, which is a sign that there's plenty of room for improvement in both the scrub speed and having a stale marker, but those are separate tasks.

Closing this since the original issue should be fixed. Feel free to reopen if you see the same issue, or create additional issues.