Enhance QuestionableQuesting: Export all posts (including non-threadmark posts) to epub

xypha commented 2 months ago

Problem Non-threadmark posts cannot be exported.

Steps to replicate:

Open 'Story Only' thread of With This Ring in browser tab.
Scenario 1: Run WebToEpub from toolbar icon → only 1 chapter is loaded. None of the threadmarks on the page are seen.
In the browser tab containing QQ post, click on "Threadmarks" button and select "View all 148 threadmarks" option.
Scenario 2: Run WebToEpub from toolbar icon → only 25 chapters are loaded.
In the browser tab containing QQ post, on the "Threadmarks" overlay, change the "Per page:" option to maximum (as of 2024.09.03, it is 400).
Scenario 3: Run WebToEpub from toolbar icon → all 148 chapters are loaded, but non-threadmark posts cannot be exported.

WebtoEpub issue 1 WebtoEpub issue 2 WebtoEpub issue 3

Describe the solution you'd like

Possible solution to Scenario 1 and 2:

In WebToEpub popup tab, add warning text (maybe above the Chapters Count), telling users to ensure all threadmarks are loaded before export.
Add section in Wiki on how to add all threadmarks to chapter link this on QQ and similar sites (and add a link in the warning text if above solution is implemented).

Possible solution to Scenario 3:

Add advanced option to export all posts from all thread pages (in this case, using page navigation instead of threadmarks - pages 1 to 135, as of 2024.09.03).

Describe alternatives you've considered Exporting through FicHub (also on GitHub) solves Scenario 1 and 2 - but fails to export images (which is a deal breaker). For Scenario 3, the problem persists. Non-threadmark posts cannot be exported.

Additional context Current version: 0.0.0.167 Browser: Firefox 129.0.2 (64-bit) OS: Windows 11 23H2

xypha commented 2 months ago

Malware alert - file was not opened 2024-09-03 @ 11：15：50

gamebeaker commented 2 months ago

@xypha lucky you...

Kiradien commented 2 months ago

@gamebeaker At this point we can be fairly confident that zawa999 is violating GitHub's Terms of Service - before deleting comments in the future, it's probably worth reporting their content for chance at a full IP ban. I'd do it, but you seem to find & delete them before I see them xD

As for the mentioned issue, I've actually been thinking of something similar for all Xenforo forums, mostly for the perspective of bulk threadmark download through "Reader Mode", however it would work similarly for this case as well. One issue: it runs into a few faults - mostly with WebToEpubs indexing logic - e.g. Each chapter link is pre-defined before generation begins, which would be impossible under this paging structure. That can theoretically be worked around, but even if it can, it won't work the exact same as other sites.

I'll look at a potential solution on this but if one is possible, it will likely require configuration in [WebToEpub > Advanced Options > Manually Select Parser] to differentiate it from the standard, unless someone has a better idea for handling this case.

dteviot commented 2 months ago

@Kiradien Some notes

I don't understand Scenario 3

As regards Scenario 2, Am I missing something? WebToEpub could be made to detect there's multiple ToC pages, and fetch them. URL for each page seems to be like: https://forum.questionablequesting.com/threads/with-this-ring-young-justice-si-story-only.8961/threadmarks?per_page=25&page=4

Kiradien commented 2 months ago

I don't understand Scenario 3

As regards Scenario 2, Am I missing something? WebToEpub could be made to detect there's multiple ToC pages, and fetch them. URL for each page seems to be like: https://forum.questionablequesting.com/threads/with-this-ring-young-justice-si-story-only.8961/threadmarks?per_page=25&page=4

Yeah, this enhancement is entirely edge-cases; I understand why you're confused, it's also why I will not add these fixes to the main parser. A number of things are happening here, but it's mostly just that the author didn't threadmark his chapters. This is not a failing of WebToEpub's current design for Xenforo, but a general work-around that is actually useful in other cases.

The UI is also different for this archive page, normally paging isn't really needed for threadmarks... it's a really odd edgecase.

Some notes of my own: I wouldn't normally consider this type of enhancement, it's only because of "Reader Mode" allowing retrieval of multiple chapter simultaneously that I'm working on it... It can be handy to download these books a bit quicker with less strain on the server side. It's also a fair bit of fun to dig into elements I don't usually touch.

xypha commented 2 months ago

@Kiradien To clarify further on Scenario 3: my intention was to suggest exporting non-chapter posts and comments... sometimes, reading non-threadmark posts (i.e., user comments, speculation/theory crafting and author's responses) is helpful or just plain fun. An option to export all posts in a thread to epub for easy reading would be nice.

Kiradien commented 2 months ago

@xypha No worries, that is actually what I'm working on. Just taking time since I'm poking around elements I don't usually touch in my free time. It might end up being a bit buggy on chapter titles (Since the title is usually pulled from the 'threadmark'), but the goal should be feasible... Just a bit slower to release than most patches I work on.

My comments about 'Reader Mode' is simply because that is what I will personally use it to export, no intent to make it exclusive to that.

Jemeni11 commented 2 months ago

Exporting through FicHub (also on GitHub) solves Scenario 1 and 2 - but fails to export images (which is a deal breaker). For Scenario 3, the problem persists. Non-threadmark posts cannot be exported.

Hi. I made a CLI tool for adding images to FicHub here. You'll have to install python to use it though

Kiradien commented 2 months ago

Sorry for the delay on this; was working on it on and off and was a little too intent on a 'perfect' solution. PR uploaded with a working solution - it's not the perfect solution I wanted, all posts on each QQ 'page' are corelated to a single chapter, but it does the job.

I'll push the PR through once the issues are resolved

Trying to make each post a chapter with the current setup of web2epub is a bit too much of a nightmare.

In order to use the new parser, you need to open up advanced options and select the "Xenforo Batch Post Parser" under manual parsers.

gamebeaker commented 2 months ago

Test versions for Firefox and Chrome have been uploaded to https://github.com/dteviot/WebToEpub/releases/tag/developer-build. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes.

xypha commented 2 months ago

@Kiradien This works for With This Ring.

Thank you!

Saw a bunch of errors - mostly about fetching images, but also others.

No complaints though. THANK YOU! this is what I wanted.

Just going to share the errors here in case they might be relevant.

several others once the epub was downloaded -- see attached text file (too long to post directly in the comment)

errors.txt
403 errors (3 in total for different domains) that I had to click on skip to complete the epub download.

Example :

WARNING: Site '1.bp.blogspot.com' has sent an Access Denied (403) error.
You may need to logon to site, or browse site normally
until you get a Cloudflare "Are you a human" page or satisfy some other CAPTCHA
before WebToEpub can continue.
Fetch of image 'http://1.bp.blogspot.com/_M7D1hE_0cz0/S9GqWbJ-0pI/AAAAAAAADLk/AUuEqBBzDCE/s1600/GL4602.jpg' for page 'https://forum.questionablequesting.com/threads/with-this-ring-young-justice-si-story-only.8961/page-53' failed with network error 403. This is an intermittent error. If you retry in a few minutes, it may succeed. promptUserForRetry@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:57:19
onResponseError@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:48:25
checkResponseAndGetData@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:207:45
wrapFetchImpl@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:197:31
async*retryFetch@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:77:27
async*onResponseError@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:40:25
checkResponseAndGetData@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:207:45
wrapFetchImpl@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:197:31
async*wrapFetch@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/HttpClient.js:157:27
fetchImage@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/ImageCollector.js:335:40
fetchImages@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/ImageCollector.js:108:28
async*fetchImagesUsedInDocument/<@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:545:44
promise callback*fetchImagesUsedInDocument@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:543:14
fetchWebPageContent/<@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:528:31
promise callback*fetchWebPageContent@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:518:59
async*fetchWebPages/<@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:491:69
fetchWebPages@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:491:41
async*fetchContent@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/Parser.js:463:21
fetchContentAndPackEpub@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/main.js:153:16
EventHandlerNonNull*addEventHandlers@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/main.js:464:9
window.onload@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/main.js:584:13
EventHandlerNonNull*main<@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/main.js:579:5
@moz-extension://9c47d7d8-1255-4f03-beca-5faaf67f2e8b/js/main.js:598:3

dteviot commented 2 months ago

@xypha

I had a quick skim through them. All I saw were WebToEpub reporting it was unable to retrieve an image. (So you know it won't be in the epub, and it's not WebToEpub's fault.)

e.g. http://static.comicvine.com seems to be down/gone 404 errors speak for themselves. etc.

dteviot commented 2 weeks ago

@xypha

Updated version (1.0.1.0) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in a few hours (typical) to 21 days.

My thanks again to @Kiradien for his hard work

xypha commented 2 weeks ago

Thank you!

dteviot / WebToEpub

Enhance QuestionableQuesting: Export all posts (including non-threadmark posts) to epub #1454