FicHub / fichub.net

web frontend for generating ebooks from fanfic
https://fichub.net
GNU Affero General Public License v3.0
45 stars 2 forks source link

Data submission for stories FH has trouble with, or cannot find (anymore) #10

Open andreas-kupries opened 3 years ago

andreas-kupries commented 3 years ago

After adding all my old ff.net references into my new database and tool quite a number of a stories came up with problems.

The more easy kind can be seen in the session below. After that I will talk about more thorny things (because the issues around them are not technical)

hephaistos:(541) ~/.fichub > fh fetch 3333 3334 3335
Sun Jun 13 12:42:33 CEST 2021 NOTE [3333]: Query https://www.fanfiction.net/s/85035 (Libby Thomas: A Duet of Pigtails) ...
Sun Jun 13 12:42:35 CEST 2021 NOTE [3334]: Query https://www.fanfiction.net/s/85084 (Libby Thomas: A Duet of Pigtails - The Age of the Black Blade) ...
Sun Jun 13 12:42:38 CEST 2021 NOTE [3335]: Query https://www.fanfiction.net/s/130609 (Libby Thomas: A Duet of Pigtails - Prologue) ...
Sun Jun 13 12:42:41 CEST 2021 ERR  [3333]: -3 - err: missing chapter: 2/12 - fanfiction.net is fragile at the moment; please try again later or check the discord
Sun Jun 13 12:42:41 CEST 2021 ERR  [3334]: -3 - err: missing chapter: 2/2 - fanfiction.net is fragile at the moment; please try again later or check the discord
Sun Jun 13 12:42:41 CEST 2021 ERR  [3335]: -3 - err: missing chapter: 1/2 - fanfiction.net is fragile at the moment; please try again later or check the discord

So, fichub recognizes the story as existing, has trouble to get all the chapters from FF, and thus is not serving anything.

Thing is, I do have the raw html for all the chapters of these stories, from their last download (... Files are dated May 7, 2011).

Would it make sense to zip them up and make them available to you ? Would your system have the ability to ingest such html into your system, just from the chapters of a story ? (I could provide provide author page html also). Of note, AFAIK FF has changed the detailed format of chapter HTML a few times over the time of their existence. I cannot guarantee that the HTML files I have are in a format your current system could process.

hephaistos:(542) ~/.fichub > ls tmp/*
tmp/130609:
001.html  002.html

tmp/85035:
001.html  002.html  003.html  004.html  005.html  006.html  007.html  008.html  009.html  010.html  011.html  012.html

tmp/85084:
001.html  002.html

Now the thorny issue, a question of policy instead of technical:

Quite a number of my stories have states deleted, deleted-complete. And a lot of the complete stories seem to have become silently deleted-complete also, over time. I.e. FH tells me that it cannot find them at FF anymore.

However, I again have raw html chapter files for pretty much all of them.

Which means, even if you can ingest raw chapter files, there is a natural question of policy in need of an answer:

Should FH (be able to) serve FF stories the authors have deleted from FF ? (It applies of course generally to all sites FH pulls information from)

Right now this is only about all these old stories I have the data for. However this will also become a question for the future, as stories FH currently/already has in its database, are getting removed from FF by their authors for whatever reason.

In their case the question becomes: Will/should FH then stop serving them ?

iridescent-beacon commented 3 years ago

So, fichub recognizes the story as existing, has trouble to get all the chapters from FF, and thus is not serving anything.

The root cause for those three, and likely many other very old stories failing to export through fichub is fragile html handling on fichub's side that's already slated to be replaced. There are reams of malformed and odd html in the early fics ranging from ms word tags, literal < and > being used as dialogue quotes instead of &gt; and &lt;, to several different phases where FFN tried to standardize their old stories onto a subset of html by replacing the first letter of all tags it now doesn't allow with x among other things.

In this case, fichub caches all the chapters but then fails to serve them since it's sanitation process falls over.

Those three specific ids should be working now.

Would it make sense to zip them up and make them available to you ? Would your system have the ability to ingest such html into your system, just from the chapters of a story ? (I could provide provide author page html also). Of note, AFAIK FF has changed the detailed format of chapter HTML a few times over the time of their existence. I cannot guarantee that the HTML files I have are in a format your current system could process.

I think it would make sense to zip up what you have to be ingested at some point. That process is not fully in place, but there have been one-offs done already. Importantly, there's a large archive.org dataset sitting on my home computer waiting to be processed which is a all a previous FFN layout which needs support.

In their case the question becomes: Will/should FH then stop serving them ?

That's a thorny issue indeed. I think it may depend on the reason it's no longer available upstream. Complying with laws is probably going to be the default stance in the case of egregiously illegal content. There's a much wider swath of fics that have not been deleted by their author, but by FFN for some TOC violation which may not matter to fichub. In either case, fichub is hardly the only other copy of many of these fics (and everything else on the internet) so I'm not sure there's much purpose in someone nagging various repositories to expunge them. Will have to see how it plays out in practice though.

andreas-kupries commented 3 years ago

So, fichub recognizes the story as existing, has trouble to get all the chapters from FF, and thus is not serving anything.

The root cause for those three, and likely many other very old stories failing to export through fichub is fragile html handling on fichub's side that's already slated to be replaced. There are reams of malformed and odd html in the early fics ranging from ms word tags, literal < and > being used as dialogue quotes instead of > and <, to several different phases where FFN tried to standardize their old stories onto a subset of html by replacing the first letter of all tags it now doesn't allow with x among other things.

Ouch. While I knew that FF had changed things around on reader I had no idea it was this bad on the writer's side even.

In this case, fichub caches all the chapters but then fails to serve them since it's sanitation process falls over.

Sensible.

Those three specific ids should be working now.

I am sorry to tell you that they do not. The error messages have changed however. They are now unable to find fic. Not too bothered however.

Would it make sense to zip them up and make them available to you ? Would your system have the ability to ingest such html into your system, just from the chapters of a story ? (I could provide provide author page html also). Of note, AFAIK FF has changed the detailed format of chapter HTML a few times over the time of their existence. I cannot guarantee that the HTML files I have are in a format your current system could process.

I think it would make sense to zip up what you have to be ingested at some point. That process is not fully in place, but there have been one-offs done already. Importantly, there's a large archive.org dataset sitting on my home computer waiting to be processed which is a all a previous FFN layout which needs support.

Ok. I have packaged my story data up into an xz compressed tarball (Size 650M. Unpacked it is 4.5G).

What is your preferred way of getting this archive over to you ? While I could put this up somewhere under https://akupries.tclers.tk/tmp/... until you have grabbed it, I am not sure though if exposing the link here, to the public, would be the best idea.

In their case the question becomes: Will/should FH then stop serving them ?

That's a thorny issue indeed. I think it may depend on the reason it's no longer available upstream. Complying with laws is probably going to be the default stance in the case of egregiously illegal content. There's a much wider swath of fics that have not been deleted by their author, but by FFN for some TOC violation which may not matter to fichub. In either case, fichub is hardly the only other copy of many of these fics (and everything else on the internet) so I'm not sure there's much purpose in someone nagging various repositories to expunge them. Will have to see how it plays out in practice though.

Part of the trouble here is, that I certainly do not know the reasons for story removal. While one could ask FF if they have a general list of stories deleted by them for TOC violations I have no idea if they maintain such a list, or would be willing to hand it out to interested people. This is further muddled given that I heard authors complain about coordinated campaign to specific stories removed even without actual TOC violation, etc. And of course, I have no idea how credible that is in general, and/or for specific authors.

In the end I have no good solution.

iridescent-beacon commented 3 years ago

Those three specific ids should be working now.

I am sorry to tell you that they do not. The error messages have changed however. They are now unable to find fic. Not too bothered however.

There might be a separate issue then since they work fine from where I'm sitting:

for url in https://www.fanfiction.net/s/85035 https://www.fanfiction.net/s/85084 https://www.fanfiction.net/s/130609 ; do
    echo $url
    curl -s "https://fichub.net/api/v0/epub?q=${url}" | jq '{err, msg, hash: .hashes | .epub}' 
done
https://www.fanfiction.net/s/85035
{
  "err": 0,
  "msg": null,
  "hash": "b3193a652a8760633af8558057baefef"
}
https://www.fanfiction.net/s/85084
{
  "err": 0,
  "msg": null,
  "hash": "69024f1f39f17bd3c1a4da638be2b47d"
}
https://www.fanfiction.net/s/130609
{
  "err": 0,
  "msg": null,
  "hash": "376e4ad5d2b30195eb7023c52ae60db4"
}

Maybe something to debug in a separate issue?

What is your preferred way of getting this archive over to you ?

You can get my email from this repo's git log and send me a link there, or on libera I'm irides and sitting in ##fichub (tangent: I was iris for years on freenode, but got nick sniped by a few hours in the shuffle to libera; oh well, wasn't actually using it much lately)

In the end I have no good solution.

Me either. Will play it by ear for now.

andreas-kupries commented 3 years ago

Those three specific ids should be working now.

I am sorry to tell you that they do not. The error messages have changed however. They are now unable to find fic. Not too bothered however.

There might be a separate issue then since they work fine from where I'm sitting:

Thank you for checking ... Rechecking here again the issue has disappeared on me here also. It seems it was some transient hickup.

Maybe something to debug in a separate issue?

Yes, if it had not disappeared when I rechecked again.

What is your preferred way of getting this archive over to you ?

You can get my email from this repo's git log and send me a link there,

Will do.

andreas-kupries commented 3 years ago

Link sent

iridescent-beacon commented 3 years ago

Link sent

Thanks! I was able to grab a copy over ipv4, and sent you a reply letting you know that ipv6 isn't working -- but that may be on my end. I don't send much email from that domain so I'm not sure if my reply will actually reach your inbox at this time, hence the comment here too.

I'm going to leave this issue open until I actually ingest it.

andreas-kupries commented 3 years ago

I got your mail.