Open ChenNdG opened 2 years ago
I think it's just because of the website update, but I'm putting it here anyway to warn and spot. Thank you for your taff and this excellent tool that helps me a lot ^^
@ChenNdG Looks like they've made changes to make it much more difficult for WebToEpub work, if not impossible. Don't expect anything soon. Frankly, I'm tempted to just remove the site.
Because the DOM is now lazy loaded, right? That's the only issue?
In my work I use Workona extension - it loads webpages in minimized tab. I guess I could try hacking around the WebToEpub and see if I can port that solution - that way we can open the lazy loaded pages in the background before scrapping them.
It will be most likely slower that current solution, but nothing else except using Puppeter via devtools comes to my mind and Google (I'm a developer but I have never in the past developed a Chrome extension) -
https://github.com/puppeteer/puppeteer/tree/6522e4f524bdbc1f1b9d040772acf862517ed507/utils/browser
@sztrzask From a quick look at the network activity, I suspect
There's also the issue that because it looks like Wuxiaworld is doing this to prevent copying, getting WebToEpub to bypass the protection is a violation of Google's terms and conditions will result in WebToEpub and me being permabaned by Google if the Wuxiaworld site owners were to complain.
@ChenNdG I'm pretty sure site could be copied by one of the tools that drives a browser via Selenium. Looks like https://github.com/Flameish/Novel-Grabber/issues/326 is currently in process of being updated to handle the Wuxiaworld site changes.
Ok, thanks for the answer @dteviot . I'll just pray that everything goes well.
@dteviot I tried hacking around and:
getting WebToEpub to bypass the protection is a violation of Google's terms and conditions will result in WebToEpub and me being permabaned by Google if the Wuxiaworld site owners were to complain.
What if I were to add Wuxiaworld parser as a plugin? I.e. as an another extension that can communicate with WebToEpub extension? That way it would be separate from WebToEpub, it would be in a seperate repo etc, etc.
That would however require some work in WebToEpub extension - adding background.js communication and some changes to main flow to allow for it.
I'd be happy to introduce all those changes to WebToEpub and then slowly start working on a new repo as long as you agree with the idea - I'd rather contribute than fork :)
I would have 1 request though if I were to introduce the changes - I'd like to migrate codebase from javascript to typescript first.
@sztrzask
I'd like to migrate codebase from javascript to typescript first.
Doing that adds complication to the Firefox approval process. So, I'm going to have to say no.
I then managed to open a new chapter in a new tab, and using new ContentScripts command I managed to obtain chapter text and put it in parser.
I've considered something like this in the past. (Having a parser that opens page in a tab, and then inject a content script into it to extract the content.) So, I'd have a base class, and then derive from it for parser for each site.
My thoughts for how to work around Google's rules.
Doing that adds complication to the Firefox approval process. So, I'm going to have to say no.
Understood, no typescript migration. Out of curiosity, how does it complicate the process? I tried googling it, but found nothing
Instead of
"Site is employing copy protection. So can't copy. See XXXX for more details.
maybe something more ambigous because "lazy loading" does not mean "copy protection".
"WebToEpub" does not support this site. It is, however, possible, that a 3rd party plugin does. See XXXX for more details on how to find 3rd party plugins
I'll try to prepare a draft of architecture and process diagrams before weekend so that we can agree upon a solution and figure out plugin-to-plugin communication interfaces before we start coding. Does that work for you?
@sztrzask
Out of curiosity, how does it complicate the process? I t
Basically, when submitting, you need to tell them that code is transpiled, and they require the source and build process, and they check the source matches the transpiled code submitted. And approval takes much, much longer.
maybe something more ambigous because "lazy loading" does not mean "copy protection".
Or just "Google's Terms and Conditions don't allow WebToEpub to work on this site. Refer XXXX for more details."
I wasn't thinking of a plugin process, just different build. e.g. Have two files for the Wuxiaworld parser. The legal one that's a "site not supported" stub. Second one is the working parser. Both versions of file are in Git. (with illegal in a sub directory marked, say "advanced" or something.) Then build script creates "legal" WebToEpub by leaving the advanced files out. And it also builds the "advanced" versions of the parser. Which are not submitted to Google/Mozilla, but just available as pre-built "install from source" packages from a file share. But if you want a plug-in, feel free to sketch it out in more detail.
@sztrzask Also, it's not the lazy loading that's the copy protection, its the use of CORS.
legal illegal
I don't think those words apply here as web-scrapping legality is something that differes country to country and I'd prefer if you didn't use it :) Let's instead use potentially breaking Chrome/Mozzila extensions ToS
as that's what we're afraid of. I know that you meant it in that way, but let's keep it clear for anyone else reading our conversation.
I wasn't thinking of a plugin process, just different build.
At first I was thinking it would be easier for developers, because if we were to create a fork or feature branch, then we would have to keep it updated - however now that I looked at the repo history, it seems that WebToEpub is feature complete and any development you do is just creating new parsers, right? If that's true, then yeah, it would be much easier to just fork or feature branch and create a version that might not fulfill Chrome/Mozzila ToS there.
Also, it's not the lazy loading that's the copy protection, its the use of CORS.
I'm sorry for miscommunication. I meant that due to the website being lazy loaded, it's now harder to scrap it.
CORS too isn't copy protection. CORS is just in-browser tool for "bypassing" SOP.
To be honest, there's no copy protection, or rather scraping protection mechanism in SOP either.
Wuxiaworld changed it's website to SPA application. SPA application are harder to scrap, because their content is dynamicaly generated (or rather lazy loaded).
Anywho, this is moot, as we both agree that their current website is just harder to scrap, due to all
being lazy loaded (se we cannot scrap from the source)
using SOP (CORS) (so we cannot execute Fetch calls from webextensions)
using gRPC (so we cannot execute Fetch calls from within injected ContentScript as we have no idea what the proto request payloads are, since they are encrypted, and we do not want to break the encryption nor try to monkey patch website scripts)
@sztrzask Changed to compliant / non-compliant
Yes, I'm mostly updating/adding parsers. My thought was one code base, but have the build generate compliant and non-compliant versions. And by adding a base parser with the main logic for "create tab for each chapter, then fetch all content from tab", it will be easier to handle other sites like this. (Just need to derive the class, and customize logic to find the wanted content" No need for a branch.
While CORS wasn't designed for copy protection, I've encountered a number of sites using it to do so.. Making the grpc calls to get content fail, unless they're from web page.
Any news on this? i didn't understand the thread very much lol
@NuraGtH Short answer: no news. I'm not interested in spending the amount of work required to get WebToEpub to work for this site, and I haven't heard anything from sztrzask for more than a month. I suggest you look at one of the other tools. e.g. https://github.com/Flameish/Novel-Grabber (although I don't think Wuxiaworld has been fixed there either.)
Rip, they are uploading a ton of new stuff this month, and it's the best translated stuff there is, welp thanks xd
I'm currently busy working on my own Calibre UI app - I might add Wuxiaworld parser there when I need it. I decided against coding it for this tool as the solution is convoluted and the investment needed doesn't seem worth it.
As an easy workaround you can use one of the sites that pirate WuxiaWorld. Many are supported by WebToEpub. Just search for the title using a search engine that isn't Google, like DuckDuckGo (Google seems to hide the sites that get DMCA takedowns). The sites sometimes mangle footnotes etc. but are usable if you just need something to read offline.
Lightnovel-Crawler currently works for it, but it doesn't appear to have login support yet so the completed novels can't be completely accessed if you have access to them from your account.
For my notes,
Describe the bug After the update of the Wuxiaworld website on 19/01/22 there are no more chapters that can be put under epub
To Reproduce Steps to reproduce the behavior:
Expected behavior Have the chapters available and possible to package in epub
Screenshots
Desktop (please complete the following information):
Additional context Add any other context about the problem here.