Feature Request: Xenforo threadmark/index flexibility

JimmXinu / FanFicFare

FanFicFare is a tool for making eBooks from stories on fanfiction and other web sites.

Other

772 stars 165 forks source link

Feature Request: Xenforo threadmark/index flexibility #409

Closed hseg closed 5 years ago

hseg commented 5 years ago

Consider https://forums.sufficientvelocity.com/threads/28074. It only has threadmarks for the actual story; while the first post contains links for the omake and fanart. So if I invoke fanficfare with both the thread and the first post's IDs, I can get all relevant posts. It would be nice to get these in one ebook, though. I propose a new config option (to be bikeshedded): crawl_heur: auto|threadmarks|index|both which controls this behaviour. cf. iarna/fetch-fic's ff get --and-scrape option.

JimmXinu commented 5 years ago

I've posted a test CLI version (I believe that's what you use?) that adds a new setting always_include_first_post_chapters for base_xenforoforum sites.

When set to true for a story, FFF will add the first post and all the chapter URLs it recognizes in the first post to the threadmarks it's collected.

Note that this may not work quite as you'd like--FFF only recognizes chapter URLs on the same site. So in the example given, the first couple URLs aren't included because they link to SB, not SV. Also, the first post is included twice since it's also threadmarked.

I'm not 100% convinced this is worth adding, but you can try it out and let me know what you think.

Test version of CLI for pip install: You'll have to get it from the testpypi repository. This works for me on Debian:

pip install --extra-index-url https://testpypi.python.org/pypi --upgrade FanFicFare

hseg commented 5 years ago

On Sun, Jul 07, 2019 at 02:17:18PM -0700, Jim Miller wrote:

I've posted a test CLI version (I believe that's what you use?) that adds a new setting always_include_first_post_chapters for base_xenforoforum sites.

When set to true for a story, FFF will add the first post and all the chapter URLs it recognizes in the first post to the threadmarks it's collected.

Note that this may not work quite as you'd like--FFF only recognizes chapter URLs on the same site. So in the example given, the first couple URLs aren't included because they link to SB, not SV. Also, the first post is included twice since it's also threadmarked.

I'm not 100% convinced this is worth adding, but you can try it out and let me know what you think.

Test version of CLI for pip install: You'll have to get it from the testpypi repository. This works for me on Debian:

pip install --extra-index-url https://testpypi.python.org/pypi --upgrade FanFicFare

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/JimmXinu/FanFicFare/issues/409#issuecomment-509031479

Downloaded the tarball using pip, couldn't find the string always_include_first_post_chapters anywhere -- are you sure you committed the patch?

Indeed, I'm using the CLI interface. The fact that cross-site links won't work is unfortunate, but understandable. Maybe a setting that would allow adding supplemental chapters? Or if there were a way to decouple the metadata scraping/downloading/binding steps, I would be able to do this manually?

Another example to take into consideration is this one. It is an unordered collection of oneshots, some of which get expanded into multi-chapter stories. If I were to want to save these separately, a way to specify the chapters to select would be desirable.

Though having written this, I'm starting to suspect I'm abusing FFF...

Gesh

JimmXinu commented 5 years ago

Okay, I apparently uploaded the wrong branch. Sorry about that. It's on testpypi now.

As for decoupling chapter list and chapter download:

How chapter text is parsed from chapter pages varies by origin site. As currently architected, FFF expects all chapters to be from the same site. Depending on site, such as base_xenforoforum sites with Reader mode, and some sites that further paginate chapters, it can be a bit complex. And it's enough of a corner case that I don't really want to support it for the very few cases it would be useful.

There are already three ways to limit which chapters are downloaded:

In CLI, you can use -b and -e options for which chapters to begin and end with.
You can also similarly specify a chapter range as part of the story URL: https://www.fanfiction.net/s/2565609[1-2]
You can, in personal.ini, use the ignore_chapter_url_list setting. See defaults.ini.

hseg commented 5 years ago

On Tue, Jul 9, 2019 at 11:26 PM Jim Miller notifications@github.com wrote:

Okay, I apparently uploaded the wrong branch. Sorry about that. It's on testpypi now.

Will be busy next week, will post back once I get the chance to test it.

As for decoupling chapter list and chapter download:

How chapter text is parsed from chapter pages varies by origin site. As currently architected, FFF expects all chapters to be from the same site. Depending on site, such as base_xenforoforum sites with Reader mode, and some sites that further paginate chapters, it can be a bit complex. And it's enough of a corner case that I don't really want to support it for the very few cases it would be useful.

Hm. Your point makes sense, and I'm enough of an edge case to have low hopes of this feature being implemented anyway. However,

There are already three ways to limit which chapters are downloaded:

In CLI, you can use -b and -e options for which chapters to begin and end with.

You can also similarly specify a chapter range as part of the story URL: https://www.fanfiction.net/s/2565609[1-2]

You can, in personal.ini, use the ignore_chapter_url_list setting. See defaults.ini https://github.com/JimmXinu/FanFicFare/blob/master/fanficfare/defaults.ini#L308 . These settings are all well and good, but cf the example (on ff.net) that I sent - stories are in nonconsecutive chapters there. Maybe a feature to list the chapters to be downloaded might help?

At the very least, if FFF were built like a library, with api

canonicise : URL → (extractor, story_id)
getMeta : (extractor, story_id) → meta
chapter : (extractor, story_id, chapter) → text
hasChapter : (epub, chapter) →Bool
bindBook : (meta, [text]) → epub

I should be able to make progress. But that point is moot already as I propose it - it's too big an ask and would be better served in a fork.

Thanks for humouring this thread of thought, though.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JimmXinu/FanFicFare/issues/409?email_source=notifications&email_token=AAJTW5OPVBPVZSSNBVIOXELP6TX6NA5CNFSM4H5IQQGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZRNP3Y#issuecomment-509794287, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJTW5NGIJFTLOQZRVRDCZDP6TX6NANCNFSM4H5IQQGA .

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

JimmXinu commented 5 years ago

You could probably do what you're talking about with FFF as it is now if you're willing to write code. Before, I was only talking about user level features.

If you look at the code for the basic CLI the steps are:

Create an adapter, which is site specific
Use the adapter to get metadata, including chapter list
Create a writer and write the ebook

...With the additional complications of needing a configuration and optionally looking inside existing epubs on update.

You could look inside the story object and manipulate the chapter list between getting metadata and writing the ebook. Merging chapters from different adapters (different sites) into one story would require separate adapters and might require fetching the relevent chapters first, but it should be doable.

The CLI has a developer option (--save-cache) to save fetched pages (for most sites) that is very useful for running the same tests over and over without slamming the site servers.

While I will answer questions and would be open to making minor changes to make use as an API easier, I'm not terribly interested in re-architecting and writing a ton of documentation to make FFF a 'proper' API.

hseg commented 5 years ago

On 12 July 2019 18:28:32 GMT+03:00, Jim Miller notifications@github.com wrote:

You could probably do what you're talking about with FFF as it is now if you're willing to write code. Before, I was only talking about user level features.

OK, will give these a look. As I mentioned, will be busy in the near future, though. If you look at the code for the basic CLI the steps are:

Create an adapter, which is site specific

Use the adapter to get metadata, including chapter list

Create a writer and write the ebook

...With the additional complications of needing a configuration and optionally looking inside existing epubs on update.

You could look inside the story object and manipulate the chapter list between getting metadata and writing the ebook. Merging chapters from different adapters (different sites) into one story would require separate adapters and might require fetching the relevent chapters first, but it should be doable.

The CLI has a developer option (--save-cache) to save fetched pages (for most sites) that is very useful for running the same tests over and over without slamming the site servers. Thanks for the pointers. While I will answer questions and would be open to making minor changes to make use as an API easier, I'm not terribly interested in re-architecting and writing a ton of documentation to make FFF a 'proper' API. Right, as I surmised by the end of my last message. I'll give the codebase a look, see if I can't hack something up. Worst case scenario, I'll have enough notes to start my own project - wouldn't want to saddle you with my technical burden.

hseg commented 5 years ago

OK, have time to test the fix... Again, can't find the string always_include_first_post_chapters in the tarball. Moreover, looking at the commit log (git log --all -Salways_include_first_post_chapters), I see no commit either introduced or removed that string. Are you sure you pushed it? Will try my own hand at implementing the index/download/bind split sometime later this week.

JimmXinu commented 5 years ago

cf6366dab411a1d43267fe6d42515c5768866a4f

I was trying to keep that in a separate branch in my dev env that I uploaded a CLI test version for you. However, after several days with no response, I forgot about it and I uploaded a newer test version from master branch with other changes.

I've moved always_include_first_post_chapters into master and uploaded it with other changes.

hseg commented 5 years ago

On Mon, Jul 22, 2019 at 02:54:32PM -0700, Jim Miller wrote:

cf6366dab411a1d43267fe6d42515c5768866a4f

I was trying to keep that in a separate branch in my dev env that I uploaded a CLI test version for you. However, after several days with no response, I forgot about it and I uploaded a newer test version from master branch with other changes.

I've moved always_include_first_post_chapters into master and uploaded it with other changes.

Played around with the setting. It works, thanks! However, as you warned, the fact that cross-site link detection doesn't work is a bummer (could be fixed by running FFF recursively on each supported link, but that A) would spuriously pick up fic recommendations and B) would probably be too expensive. Working on a different solution (basically adding debugging features to FFF that can be repurposed to serve as the index/scrape/bind separation I've floated several times) but that is a different ticket).

Also, the fact that the ordering follows the order of the links means in particular that omakes do not follow publication ordering as expected, which is a bit annoying.

Finally, as witnessed in https://forums.spacebattles.com/threads/455278, some fics move to threadmarks later in their life, but do not update the first post correspondingly. Some dedup logic would be nice.

Still, you've invested more effort than I expected for such an edge case, and I thank you for it. Feel free to tell me to PoC||GTFO on this issue if it eats up too much time.

JimmXinu commented 5 years ago

I think we've reached the limit of what I'm interested in considering for this right now. I will include always_include_first_post_chapters in the next release, though.

hseg commented 5 years ago

On Thu, Jul 25, 2019 at 09:16:23AM -0700, Jim Miller wrote:

I think we've reached the limit of what I'm interested in considering for this right now. I will include always_include_first_post_chapters in the next release, though.

As I've indicated, I agree that's reasonable. I'll try to build something myself that can deal with my edge case, then. Thanks for the patience so far.