Closed earl07 closed 3 years ago
If you set the "CSS selector for element holding content" to "div.block-container" you should get all posts. and setting "CSS selector for element(s) to remove:" to ".message-cell-user" should remove some of the crap. Note, trying to slice up a thread isn't really what WebToEpub is designed for.
I'm not quite sure what you mean by
And also if possible can we automatically have all of the thread page available to be the list of link to be used in the next phase
Doing some tweaking to the remove options, I get want I wanted. Thanks a lot! I was not thinking to use the remove options before.
The links for all of the thread pages, I mean, if there are many pages in the thread. Because we need to add it manually from the "edit chapter URLs" after we clicked apply.
@earl07
if there are many pages in the thread OK, I'm going to assume you mean a thread that spans multiple web pages. In which case, how do you obtain the URLs of all the pages? Please note, I can't read the language this web site is in (Indonesian maybe?)
[edit] This seems to be a thread that spans multiple pages https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/
Thats right. When I try to capture from https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/ I need to tweak the options as follow:
Hostname: www.semprot.com URL of first chapter: https://www.semprot.com/threads/holy-grail-war-update-chapter-31-galuh-vs-feni.1352983/ css content: .p-body-inner css title: .p-title-value css remove: .message-cell--user , .message-attribution--split , .is-active , .actionBar , .message-lastEdit , .semprot_wizard , .semprotnenenmontok_2 , .semprotnenenmontok_sq , .block-outer , .pollResult , .block-header , .listInline--bullet , .p-breadcrumbs , .notice-content , .warning , .semprotnenenmontok_w50 , .semprotnenenmontok_w100, .js-quickReply .block-container, #semprotnenenmontok_f img , .shareButtons-label , .shareButtons-buttons , .structItem-cell--latest , .structItem-cell--meta , .block-header , .structItem-cell--main , .structItem-cell--icon, .myLiNotTS .bbWrapper
as the options before only get the poll section.
I also add .myLiNotTS .bbWrapper
to remove non-Thread Starter post (nonTS post), as it doesn't add to the story.
Then after I click "Apply", I need to click "Edit Chapter URLs", change all the URLs to pages 1 to 44 of the thread manually. Finally click "Pack EPUB".
After that I would get all of the pages of the thread as ebook page.
The problem that I still face:
bloated unnecessary html that doesn't add to the text added to the epub, like
<noscript><div class="blockMessage blockMessage--important blockMessage--iconic u-noJsOnly">JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.</div></noscript>
<div class="blockMessage blockMessage--important blockMessage--iconic js-browserWarning" style="display: none">You are using an out of date browser. It may not display this or other websites correctly.<br />You should upgrade or use an <a href="https://www.google.com/chrome/" target="_blank" rel="noopener">alternative browser</a>.</div>
Empty ebook page if there are no post by thread starter, like page 44 in this case. (This could be a huge problem if the Thread Starter only post rarely and the page become so long (more than 100) as it is pretty popular like this: link.)
other than that its pretty good enough.
I just wondering if:
@earl07
bloated unnecessary html
Add "noscript" and "div.js-browserWarning" to css remove
Empty ebook page
That's tricky, WebToEpub expects the URLS you supply to have content. So, don't include them in the list of URLs to fetch. Of course, until you look at the page, you don't know.
But I'll see what I can do.
@earl07 Actually, this might be useful to you. https://github.com/dteviot/EpubEditor Or not, it's badly documented and assumes you've got some developer skills.
@earl07 Test versions for Firefox and Chrome have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing. Follow the "How to install from Source" instructions at https://github.com/dteviot/WebToEpub#how-to-install-from-source and let me know how it goes.
For my notes, this was 90 minutes work
Reporting:
.semprotnenenmontok_sq
to the removeUnwantedElementsFromContentElement in semprotparser.js. Error: Could not find content element for web page 'https://www.semprot.com/threads/pesantren-series.1320255/page-2'.
at chrome-extension://oibncfaefaeddnebekohlkccbejblpeg/js/Parser.js:493:23
at async Promise.all (index 0)
at async SemprotParser.fetchWebPages (chrome-extension://oibncfaefaeddnebekohlkccbejblpeg/js/Parser.js:462:17)
the example link: link
@earl07
There is a problem if there is a page (in this case page 2) there are no post by the author, WebToEpub is stop fetching
Under the advanced options, check the "Skip chapters that return HTTP 404 error" box. Then WebToEpub will insert a place holder page for those with no content, and you'll get a list of them. You can then use something like Calibre to remove the "no contents" pages.
The title is not captured, I don't know why.
More details please. For the link you gave, WebToEpub shows "CERBUNG Pesantren Series" in the dialog's title field.
@earl07 Updated version (0.0.0.134) has been submitted to Firefox and Chrome stores. Firefox version is available now. Chrome might be available in 1 to 3 weeks.
Sorry for the late response.
After use "Skip chapters that return HTTP 404 error" box, it's working great now.
The title missing is just me misunderstand that before it would be included in the first page, but now it would be as the epub file name.
I was not checking the update yet though, cause I use chrome most of the time.
Anyway, I have two related questions:
Attempt to fetch high resolution version of image from 'https://www.imagebam.com/view/ME16P4K' failed. Using lower resolution image instead.
Thanks. You rock man!
@earl07
is it possible to automatically remove the unwanted pages in calibre?
You'd probably need a script/plug-in to do that. Unfortunately, I don't know Calibre well enough to do that.
You'd be better asking around the Calibre forums.
Is it possible to solve the fetching problem for the higher resolutions image
Probably not. What the error usually means is that when WebToEpub encounters a hyperlink tag with an image in it, it assumes the image is a thumbnail and the hyperlink points to the full size image. The error message means when it follows the hyperlink, what comes back isn't an image. So, in most cases when this message appears, it means there isn't a high res image. (It's more of a informational statement, rather than an error.)
Describe the bug I want to get different post by the same thread maker of a forum thread to be pages in the ebook. But when I use the new site adder, If there is 2 or more post from the author to be captured, I only get the first post is saved.
To Reproduce Steps to reproduce the behavior:
Expected behavior All of the post by the same thread maker is captured.
Desktop (please complete the following information):
Additional context And also if possible can we automatically have all of the thread page available to be the list of link to be used in the next phase?