Please add a parser for jjwxc.net

SinD3825 commented 4 months ago

Hi there! This is a continuation of this previous issue.

Provide URL for web page that contains Table of Contents (list of chapters) of a typical story on the site: https://www.jjwxc.net/onebook.php?novelid=5126430

Did you try using the Default Parser for the site? If not, why not?

Instructions for using the default parser can be found at https://dteviot.github.io/Projects/webToEpub_DefaultParser.html I did try using the default parser, but what showed up for me in the test window was a combo of mostly specials unicode block and a bunch of different languages (armenian, korean, arabic, random fractions, etc.).

What settings did you use? What didn't work?

URL of first chapter: https://www.jjwxc.net/onebook.php?novelid=5126430&chapterid=1
CSS selector for element holding content to put into EPUB: div.novelbody
CSS selector for element holding Title of Chapter: N/A
CSS selector for element(s) to remove: N/A You mentioned in the previous bug report that: "WebToEpub is assuming the content is UTF-8, which doesn't work because the site is (probably) encoded with gb18030. Handling this requires writing a custom parser. That said, it looks like writing a parser would not be hugely difficult. (It does not appear to be using JSON for chapter content, and it looks like all chapters of a story might appear on a single Table of Contents page. (If you can confirm this, or provide link to story where Table of Contents spans multiple pages, that would be helpful.) Alternately, I COULD extend the default parser to allow you to tell WebToEpub the encoding, but I suspect that would be beyond most people to handle."

I just wanted to confirm that all chapters of the story are on a single table of contents page (ex: https://www.jjwxc.net/onebook.php?novelid=5126430). Regarding whether the site is using JSON for chapter content, I dug around in developer tools and there was a JSON script, but the area that was highlighted looked like it was only for user log in's/other site stuff and not actually for holding chapter content.

I also have no preference on extending the default parser to put in the encoding, please do whatever you think is best. Again, really appreciate your hard work and thanks for doing all this!

If the Default Parser did not work, if you have developer skills, did you try writing a new parser?

Instructions https://dteviot.github.io/Projects/webToEpub_FAQ.html#write-parser I tried looking through the FAQ's but it didn't mention anything about encoding, so I didn't really know how to cobble anything together. I don't have any coding experience in HTML/javascript, I'm sorry :(

If you don't have developer skills, can you ask a friend who does have them if they can do it for you?

N/A

If you tried writing a parser, and it doesn't work. Attach the parser here.

N/A

dteviot commented 3 months ago

@SinD3825

Test versions for Firefox and Chrome have been uploaded to https://drive.google.com/drive/folders/1B_X2WcsaI_eg9yA-5bHJb8VeTZGKExl8?usp=sharing. Pick the one suitable for you, follow the "How to install from Source (for people who are not developers)" instructions at https://github.com/dteviot/WebToEpub/tree/ExperimentalTabMode#user-content-how-to-install-from-source-for-people-who-are-not-developers and let me know how it goes. Tested with:

https://www.jjwxc.net/onebook.php?novelid=5126430, chapters 1 to 6
https://www.jjwxc.net/onebook.php?novelid=2844055, chapter 1

Please note:

I don't know Chinese, so I'm not 100% I've got everything right.
I had to strip formatting, so if there was anywhere it was important, please let me know.
Stripping garbage (in header and footer) was also tricky, And as I don't know Chinese I may have removed something important, or missed some garbage. In either case please tell me where it needs fixing.

For my notes: 59 minutes work (Fixing formatting and removing garbage proved more difficult than usual.)

SinD3825 commented 3 months ago

I tried out the chrome version and tested it on a novel I bought chapters of. On jjwxc, the first ten or so chapters are free and then are "VIP" afterwards (there's a red [VIP] next to each chapter on the title page), where you have to pay to access them. In the case of this novel, the paid chapters start at 20 and goes till the end. It works really well until I hit the paid chapters, where it looks like a similar issue of everything showing up as specials unicode block, but it also looks like the chapter itself isn't there, just the header and footer. Here's a portion of what it looks like as a screenshot, I've also attached the complete chapter as a zipped xhtml file that can be viewed in browser. This happens for all paid chapters (chapter 20-the end), I've only included the result of the first paid chapter. Screen Shot 2024-05-26 at 6 36 52 PM

I'm not sure if it's useful, but I've also attached the errors that popped up after I tried to epub the entire novel: webtoepub_errors.pdf

In terms of stripping the header and footer, the header is perfect and starts where it should, but it looks like the author notes get cut off from the footer. Is there any way that those could be included? Just a heads up that they aren't in every chapter, so if that's too much of a hassle I don't mind having components in the footer that aren't supposed to be there. They're the green text after the horizontal rule and before the white channel banner (example of bottom of chapter one):

Everything else looks great! Thank you so much for your hard work! Please let me know if you need anything, I can send my jjwxc login details if you need to access the full vip chapters. Also, would you like me to start this as a new issue?

SinD3825 commented 3 months ago

Sorry I think I forgot to attach the file. Here's the chapter 20 screenshot and the actual zipped file.

0020_Chapter_20.xhtml.zip

dteviot commented 3 months ago

@SinD3825

The author notes should now be included in the test build. Please try and let me know. Note, I'm not sure I've got the title correct. (If not, please provide correct text.)

The zipped xhtml file is of no use to me. What I need is the raw HTML of the chapter that the site sends. In other words, I need a HAR file. Basic steps:

Open chrome and browse to the table of contents page.
Do whatever you need to get permission to read chapter.
Open Chrome's developer tools
Browse to the paid chapter
Go to the developer tools window, right click on network and "Save all as HAR file" or something like that.
Zip the HAR file and email to dteviot@gmail.com. (Don't publish file to this site, it may have private info in it.)

https://community.dynatrace.com/t5/Troubleshooting/How-to-generate-a-HAR-file-and-enable-dtHealthCheck/ta-p/223391

Warning, it might be a couple of weeks before I have time to properly examine it.

Tested with

https://www.jjwxc.net/onebook.php?novelid=4733839, chapters 1 to 4

For my notes: 55 minutes (extra)

dteviot / WebToEpub