dteviot / WebToEpub

A simple Chrome (and Firefox) Extension that converts Web Novels (and other web pages) into an EPUB.
Other
676 stars 132 forks source link

The extension no longer works on wuxiaworld #676

Open ChenNdG opened 2 years ago

ChenNdG commented 2 years ago

Describe the bug After the update of the Wuxiaworld website on 19/01/22 there are no more chapters that can be put under epub

To Reproduce Steps to reproduce the behavior:

  1. Go to https://www.wuxiaworld.com/novel/XXXX
  2. Click on the logo of extension
  3. See the error

Expected behavior Have the chapters available and possible to package in epub

Screenshots image

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

ChenNdG commented 2 years ago

I think it's just because of the website update, but I'm putting it here anyway to warn and spot. Thank you for your taff and this excellent tool that helps me a lot ^^

dteviot commented 2 years ago

@ChenNdG Looks like they've made changes to make it much more difficult for WebToEpub work, if not impossible. Don't expect anything soon. Frankly, I'm tempted to just remove the site.

sztrzask commented 2 years ago

Because the DOM is now lazy loaded, right? That's the only issue?

sztrzask commented 2 years ago

In my work I use Workona extension - it loads webpages in minimized tab. I guess I could try hacking around the WebToEpub and see if I can port that solution - that way we can open the lazy loaded pages in the background before scrapping them.

It will be most likely slower that current solution, but nothing else except using Puppeter via devtools comes to my mind and Google (I'm a developer but I have never in the past developed a Chrome extension) -

https://github.com/puppeteer/puppeteer/tree/6522e4f524bdbc1f1b9d040772acf862517ed507/utils/browser

dteviot commented 2 years ago

@sztrzask From a quick look at the network activity, I suspect

  1. Site is using grpc-web to encode communications and I'm not familiar with that.
  2. They're using CORS to block Fetch calls from anything other than the web page. And bypassing CORS in a web browser is designed to be impossible, as it's a security issue. (In theory, a web extension might be able to do it with the right permissions and options, but I've never been able to make it work, and even it did, getting Google to approve an extension that requires those permissions is... extremely difficult.)

There's also the issue that because it looks like Wuxiaworld is doing this to prevent copying, getting WebToEpub to bypass the protection is a violation of Google's terms and conditions will result in WebToEpub and me being permabaned by Google if the Wuxiaworld site owners were to complain.

@ChenNdG I'm pretty sure site could be copied by one of the tools that drives a browser via Selenium. Looks like https://github.com/Flameish/Novel-Grabber/issues/326 is currently in process of being updated to handle the Wuxiaworld site changes.

ChenNdG commented 2 years ago

Ok, thanks for the answer @dteviot . I'll just pray that everything goes well.

sztrzask commented 2 years ago

@dteviot I tried hacking around and:

getting WebToEpub to bypass the protection is a violation of Google's terms and conditions will result in WebToEpub and me being permabaned by Google if the Wuxiaworld site owners were to complain.

What if I were to add Wuxiaworld parser as a plugin? I.e. as an another extension that can communicate with WebToEpub extension? That way it would be separate from WebToEpub, it would be in a seperate repo etc, etc.

That would however require some work in WebToEpub extension - adding background.js communication and some changes to main flow to allow for it.

I'd be happy to introduce all those changes to WebToEpub and then slowly start working on a new repo as long as you agree with the idea - I'd rather contribute than fork :)

I would have 1 request though if I were to introduce the changes - I'd like to migrate codebase from javascript to typescript first.

dteviot commented 2 years ago

@sztrzask

I'd like to migrate codebase from javascript to typescript first.

Doing that adds complication to the Firefox approval process. So, I'm going to have to say no.

I then managed to open a new chapter in a new tab, and using new ContentScripts command I managed to obtain chapter text and put it in parser.

I've considered something like this in the past. (Having a parser that opens page in a tab, and then inject a content script into it to extract the content.) So, I'd have a base class, and then derive from it for parser for each site.

My thoughts for how to work around Google's rules.

  1. Add function to these "illegal" parsers that puts up a message along the lines of "Site is employing copy protection. So can't copy. See XXXX for more details." See mtlnation parser
  2. Following XXXX link has instructions for download from source, plus remove the function.
  3. Possibly have build step that makes versions of extension without the function, and add to Google Drive folder.
sztrzask commented 2 years ago

Doing that adds complication to the Firefox approval process. So, I'm going to have to say no.

Understood, no typescript migration. Out of curiosity, how does it complicate the process? I tried googling it, but found nothing

Instead of

"Site is employing copy protection. So can't copy. See XXXX for more details.

maybe something more ambigous because "lazy loading" does not mean "copy protection".

"WebToEpub" does not support this site. It is, however, possible, that a 3rd party plugin does. See XXXX for more details on how to find 3rd party plugins

I'll try to prepare a draft of architecture and process diagrams before weekend so that we can agree upon a solution and figure out plugin-to-plugin communication interfaces before we start coding. Does that work for you?

dteviot commented 2 years ago

@sztrzask

Out of curiosity, how does it complicate the process? I t

Basically, when submitting, you need to tell them that code is transpiled, and they require the source and build process, and they check the source matches the transpiled code submitted. And approval takes much, much longer.

maybe something more ambigous because "lazy loading" does not mean "copy protection".

Or just "Google's Terms and Conditions don't allow WebToEpub to work on this site. Refer XXXX for more details."

I wasn't thinking of a plugin process, just different build. e.g. Have two files for the Wuxiaworld parser. The legal one that's a "site not supported" stub. Second one is the working parser. Both versions of file are in Git. (with illegal in a sub directory marked, say "advanced" or something.) Then build script creates "legal" WebToEpub by leaving the advanced files out. And it also builds the "advanced" versions of the parser. Which are not submitted to Google/Mozilla, but just available as pre-built "install from source" packages from a file share. But if you want a plug-in, feel free to sketch it out in more detail.

dteviot commented 2 years ago

@sztrzask Also, it's not the lazy loading that's the copy protection, its the use of CORS.

sztrzask commented 2 years ago

legal illegal

I don't think those words apply here as web-scrapping legality is something that differes country to country and I'd prefer if you didn't use it :) Let's instead use potentially breaking Chrome/Mozzila extensions ToS as that's what we're afraid of. I know that you meant it in that way, but let's keep it clear for anyone else reading our conversation.

I wasn't thinking of a plugin process, just different build.

At first I was thinking it would be easier for developers, because if we were to create a fork or feature branch, then we would have to keep it updated - however now that I looked at the repo history, it seems that WebToEpub is feature complete and any development you do is just creating new parsers, right? If that's true, then yeah, it would be much easier to just fork or feature branch and create a version that might not fulfill Chrome/Mozzila ToS there.

Also, it's not the lazy loading that's the copy protection, its the use of CORS.

I'm sorry for miscommunication. I meant that due to the website being lazy loaded, it's now harder to scrap it.

CORS too isn't copy protection. CORS is just in-browser tool for "bypassing" SOP.

To be honest, there's no copy protection, or rather scraping protection mechanism in SOP either.

Wuxiaworld changed it's website to SPA application. SPA application are harder to scrap, because their content is dynamicaly generated (or rather lazy loaded).

Anywho, this is moot, as we both agree that their current website is just harder to scrap, due to all

dteviot commented 2 years ago

@sztrzask Changed to compliant / non-compliant

Yes, I'm mostly updating/adding parsers. My thought was one code base, but have the build generate compliant and non-compliant versions. And by adding a base parser with the main logic for "create tab for each chapter, then fetch all content from tab", it will be easier to handle other sites like this. (Just need to derive the class, and customize logic to find the wanted content" No need for a branch.

While CORS wasn't designed for copy protection, I've encountered a number of sites using it to do so.. Making the grpc calls to get content fail, unless they're from web page.

NuraGtH commented 2 years ago

Any news on this? i didn't understand the thread very much lol

dteviot commented 2 years ago

@NuraGtH Short answer: no news. I'm not interested in spending the amount of work required to get WebToEpub to work for this site, and I haven't heard anything from sztrzask for more than a month. I suggest you look at one of the other tools. e.g. https://github.com/Flameish/Novel-Grabber (although I don't think Wuxiaworld has been fixed there either.)

NuraGtH commented 2 years ago

Rip, they are uploading a ton of new stuff this month, and it's the best translated stuff there is, welp thanks xd

sztrzask commented 2 years ago

I'm currently busy working on my own Calibre UI app - I might add Wuxiaworld parser there when I need it. I decided against coding it for this tool as the solution is convoluted and the investment needed doesn't seem worth it.

Mathnerd314 commented 2 years ago

As an easy workaround you can use one of the sites that pirate WuxiaWorld. Many are supported by WebToEpub. Just search for the title using a search engine that isn't Google, like DuckDuckGo (Google seems to hide the sites that get DMCA takedowns). The sites sometimes mangle footnotes etc. but are usable if you just need something to read offline.

Mentole commented 2 years ago

Lightnovel-Crawler currently works for it, but it doesn't appear to have login support yet so the completed novels can't be completely accessed if you have access to them from your account.

dteviot commented 1 year ago

For my notes,

  1. Primary code in Lightnovel-Crawler that does WuxiaWorld is https://github.com/dipu-bd/lightnovel-crawler/blob/master/sources/en/w/wuxiacom.py
  2. It looks like a grpc library is being used to do the heavy lifting.
  3. The code still isn't small.