Bulk mirroring of databases

andreas-kupries commented 3 years ago

Something to consider, I believe, for the long term:

The ability to bulk mirror the entire production/archive databases. Replication and backup. Maybe rsyncable, or any other method for incremental transfer, to reduce the effort of keeping current after the initial full replication.

iridescent-beacon commented 3 years ago

The ability to bulk mirror the entire production/archive databases.

Are you thinking raw request content, or metadata and story content, or some combination?

Maybe rsyncable, or any other method for incremental transfer, to reduce the effort of keeping current after the initial full replication.

The request table is log structured so it's pretty easy to make incremental dumps, and in fact I do. I'm not sure how publicly I want to share those, but if you want to get in contact through discord or if you're on IRC somewhere we can discuss it.

For metadata and content it may make sense to just dump the changed rows since the last run.

Ideally the parser could reconstruct the history of the fic metadata and content given an ordered series of requests, which is partially supported and what I'm working towards. Then just running that on the newest batch of requests would get someone up to date from any point, and I could periodically make metadata and content snapshots for people to start from if they don't want to process 60M requests or whatever it might be :)

There's more code changes needed for that to truly work well though, so it'd probably be best to offer different options in the meantime -- and for users that don't care about raw request data those options would still be useful.

andreas-kupries commented 3 years ago

The ability to bulk mirror the entire production/archive databases.

Are you thinking raw request content, or metadata and story content, or some combination?

I was more thinking meta data and story content. I am not sure what you mean by raw request content (and later request data). This seems to be something internal to your code architecture ? If that is about the communication between FH and the origin sites, or from users to FH, then I do not see that as something which should be shared. Especially not if this is about FH users.

Maybe rsyncable, or any other method for incremental transfer, to reduce the effort of keeping current after the initial full replication.

The request table is log structured so it's pretty easy to make incremental dumps, and in fact I do. I'm not sure how publicly I want to share those, but if you want to get in contact through discord or if you're on IRC somewhere we can discuss it.

See above wrt request data, I am likely opposed to sharing such.

For metadata and content it may make sense to just dump the changed rows since the last run.

Ideally the parser could reconstruct the history of the fic metadata and content given an ordered series of requests, which is partially supported and what I'm working towards. Then just running that on the newest batch of requests would get someone up to date from any point, and I could periodically make metadata and content snapshots for people to start from if they don't want to process 60M requests or whatever it might be :)

There's more code changes needed for that to truly work well though, so it'd probably be best to offer different options in the meantime -- and for users that don't care about raw request data those options would still be useful.

Please do not feel pressured to implement any of such now, or even soon. I really meant it when I used the long term qualifier.

iridescent-beacon commented 3 years ago

Are you thinking raw request content, or metadata and story content, or some combination?

I was more thinking meta data and story content. I am not sure what you mean by raw request content (and later request data). This seems to be something internal to your code architecture ? If that is about the communication between FH and the origin sites, or from users to FH, then I do not see that as something which should be shared.

By raw request content (and request data) I meant from FH to the origin sites. Not something specific to FH, more a poor man's archive.org. Most of the devs in this space I've talked to have had zero interest in preparsed content -- you and the creator of fichub-cli are the only ones that have expressed interested in that which is why I wanted to clarify. Likely because they already sunk time into their own parsers. Having a shared cache hopefully reduces the load on origin sites, but there's certainly questions about even that.

Especially not if this is about FH users.

Agreed, I don't see sharing anything about users of FH except heavily aggregated data like "fic X was requested N times" (such as on the /popular page) is a good idea. There are far off plans for opt in features that make sense to be public -- such as recommendation lists -- but those would all be restricted to actually registered accounts that know what they're getting into.

Please do not feel pressured to implement any of such now, or even soon. I really meant it when I used the long term qualifier.

Thanks; I'll keep it in mind and will leave this issue open at least.

andreas-kupries commented 3 years ago

Are you thinking raw request content, or metadata and story content, or some combination?

I was more thinking meta data and story content. I am not sure what you mean by raw request content (and later request data). This seems to be something internal to your code architecture ? If that is about the communication between FH and the origin sites, or from users to FH, then I do not see that as something which should be shared.

By raw request content (and request data) I meant from FH to the origin sites. Not something specific to FH, more a poor man's archive.org.

Ah.

Most of the devs in this space I've talked to have had zero interest in preparsed content -- you and the creator of fichub-cli are the only ones that have expressed interested in that which is why I wanted to clarify. Likely because they already sunk time into their own parsers.

Heh. And I am coming at this mostly from the point of a user without having done any real parsing myself, who simply wants a nice backup of the stories read in the past, and currently reading.

And then my packrat tendencies kicked in and said WIBNI to save everything FH has, even if not read by me, and likely never read by me ? Replicate to guard against loss. For another example of me being a packrat, see https://akupries.tclers.tk/hoard/ - That is about project sources (Total in both areas around 100G by now). Especially https://akupries.tclers.tk/hoard/self/store_1087.html :wink:

Having a shared cache hopefully reduces the load on origin sites,

Also a good point.

but there's certainly questions about even that.

Yep. At least FF makes money through Ads. I am pretty sure that all the latest changes were not about system load, and more about preventing users from bypassing having to see such.

FicHub / fichub.net

Bulk mirroring of databases #8