Capture from a forum thread

earl07 commented 3 years ago

Describe the bug I want to get different post by the same thread maker of a forum thread to be pages in the ebook. But when I use the new site adder, If there is 2 or more post from the author to be captured, I only get the first post is saved.

To Reproduce Steps to reproduce the behavior:

Go to 'https://www.semprot.com/threads/cheng-hoa-kiam-cerita-lepas-asmaraman-s-kho-ping-hoo.1122433/'
Click on 'webtoepub addon icon'
on the format: hostname: semprot.com URL of first chapter: https://www.semprot.com/threads/cheng-hoa-kiam-cerita-lepas-asmaraman-s-kho-ping-hoo.1122433/ (required) CSS selector for element holding content to put into EPUB: .myLiTS .bbWrapper CSS selector for element holding Title of Chapter: .p-title-value CSS selector for element(s) to remove:
click test page
It only showing the first post. The second, third, and fourth by the same thread author is not included.

Expected behavior All of the post by the same thread maker is captured.

Desktop (please complete the following information):

OS: Windows 110
Browser Chorme
Version Version 91.0.4472.124 (Official Build) (64-bit)

Additional context And also if possible can we automatically have all of the thread page available to be the list of link to be used in the next phase?

dteviot commented 3 years ago

If you set the "CSS selector for element holding content" to "div.block-container" you should get all posts. and setting "CSS selector for element(s) to remove:" to ".message-cell-user" should remove some of the crap. Note, trying to slice up a thread isn't really what WebToEpub is designed for.

I'm not quite sure what you mean by

And also if possible can we automatically have all of the thread page available to be the list of link to be used in the next phase

earl07 commented 3 years ago

Doing some tweaking to the remove options, I get want I wanted. Thanks a lot! I was not thinking to use the remove options before.

The links for all of the thread pages, I mean, if there are many pages in the thread. Because we need to add it manually from the "edit chapter URLs" after we clicked apply.

dteviot commented 3 years ago

@earl07

if there are many pages in the thread OK, I'm going to assume you mean a thread that spans multiple web pages. In which case, how do you obtain the URLs of all the pages? Please note, I can't read the language this web site is in (Indonesian maybe?)

[edit] This seems to be a thread that spans multiple pages https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/

earl07 commented 3 years ago

Thats right. When I try to capture from https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/ I need to tweak the options as follow:

Hostname: www.semprot.com URL of first chapter: https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/ css content: .p-body-inner css title: .p-title-value css remove: .message-cell--user , .message-attribution--split , .is-active , .actionBar , .message-lastEdit , .semprot_wizard , .semprotnenenmontok_2 , .semprotnenenmontok_sq , .block-outer , .pollResult , .block-header , .listInline--bullet , .p-breadcrumbs , .notice-content , .warning , .semprotnenenmontok_w50 , .semprotnenenmontok_w100, .js-quickReply .block-container, #semprotnenenmontok_f img , .shareButtons-label , .shareButtons-buttons , .structItem-cell--latest , .structItem-cell--meta , .block-header , .structItem-cell--main , .structItem-cell--icon, .myLiNotTS .bbWrapper

as the options before only get the poll section. I also add .myLiNotTS .bbWrapper to remove non-Thread Starter post (nonTS post), as it doesn't add to the story.

Then after I click "Apply", I need to click "Edit Chapter URLs", change all the URLs to pages 1 to 44 of the thread manually. Finally click "Pack EPUB".

After that I would get all of the pages of the thread as ebook page.

The problem that I still face:

bloated unnecessary html that doesn't add to the text added to the epub, like <noscript><div class="blockMessage blockMessage--important blockMessage--iconic u-noJsOnly">JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.</div></noscript> <div class="blockMessage blockMessage--important blockMessage--iconic js-browserWarning" style="display: none">You are using an out of date browser. It may not display this or other websites correctly.<br />You should upgrade or use an <a href="https://www.google.com/chrome/" target="_blank" rel="noopener">alternative browser</a>.</div>
Empty ebook page if there are no post by thread starter, like page 44 in this case. (This could be a huge problem if the Thread Starter only post rarely and the page become so long (more than 100) as it is pretty popular like this: link.)

other than that its pretty good enough.

I just wondering if:

Manually adding the chapter URLs could be done automatically, as the first pages to the last is actually exist in the first page that we take a look at, [or maybe add module in the "edit chapter URLs" to: use wildcard for the web file name and change the wildcard to sequential number with beginning and ending specified by user ]
Skip page if there are no post by the thread starter, like page 44 in this case.

dteviot commented 3 years ago

@earl07

bloated unnecessary html

Add "noscript" and "div.js-browserWarning" to css remove

Empty ebook page

That's tricky, WebToEpub expects the URLS you supply to have content. So, don't include them in the list of URLs to fetch. Of course, until you look at the page, you don't know.

But I'll see what I can do.

dteviot commented 3 years ago

@earl07 Actually, this might be useful to you. https://github.com/dteviot/EpubEditor Or not, it's badly documented and assumes you've got some developer skills.

dteviot commented 3 years ago

@earl07 Test versions for Firefox and Chrome have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing. Follow the "How to install from Source" instructions at https://github.com/dteviot/WebToEpub#how-to-install-from-source and let me know how it goes.

For my notes, this was 90 minutes work

earl07 commented 3 years ago

Reporting:

The automatic link is working great.
There is an ad image shown in the epub, I remove it by add: .semprotnenenmontok_sq to the removeUnwantedElementsFromContentElement in semprotparser.js.
The title is not captured, I don't know why.

There is a problem if there is a page (in this case page 2) there are no post by the author, WebToEpub is stop fetching, and getting this error message:

Error: Could not find content element for web page 'https://www.semprot.com/threads/pesantren-series.1320255/page-2'.
at chrome-extension://oibncfaefaeddnebekohlkccbejblpeg/js/Parser.js:493:23
at async Promise.all (index 0)
at async SemprotParser.fetchWebPages (chrome-extension://oibncfaefaeddnebekohlkccbejblpeg/js/Parser.js:462:17)

the example link: link

dteviot commented 3 years ago

@earl07

There is a problem if there is a page (in this case page 2) there are no post by the author, WebToEpub is stop fetching

Under the advanced options, check the "Skip chapters that return HTTP 404 error" box. Then WebToEpub will insert a place holder page for those with no content, and you'll get a list of them. You can then use something like Calibre to remove the "no contents" pages.

The title is not captured, I don't know why.

More details please. For the link you gave, WebToEpub shows "CERBUNG Pesantren Series" in the dialog's title field.

dteviot commented 3 years ago

@earl07 Updated version (0.0.0.134) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in 1 to 3 weeks.

earl07 commented 3 years ago

Sorry for the late response.

After use "Skip chapters that return HTTP 404 error" box, it's working great now.

The title missing is just me misunderstand that before it would be included in the first page, but now it would be as the epub file name.

I was not checking the update yet though, cause I use chrome most of the time.

Anyway, I have two related questions:

is it possible to automatically remove the unwanted pages in calibre? I am just using it and doesn't really know about it.
Is it possible to solve the fetching problem for the higher resolutions image from imagebam. For example this error here:

Attempt to fetch high resolution version of image from 'https://www.imagebam.com/view/ME16P4K' failed. Using lower resolution image instead.

Thanks. You rock man!

dteviot commented 3 years ago

@earl07

is it possible to automatically remove the unwanted pages in calibre?

You'd probably need a script/plug-in to do that. Unfortunately, I don't know Calibre well enough to do that.
You'd be better asking around the Calibre forums.

Is it possible to solve the fetching problem for the higher resolutions image

Probably not. What the error usually means is that when WebToEpub encounters a hyperlink tag with an image in it, it assumes the image is a thumbnail and the hyperlink points to the full size image. The error message means when it follows the hyperlink, what comes back isn't an image. So, in most cases when this message appears, it means there isn't a high res image. (It's more of a informational statement, rather than an error.)

dteviot / WebToEpub

Capture from a forum thread #556