Suggestion: option to update parsers after installtion

asifm91 commented 5 years ago

Similar to how adblock extensions update their blocklist/whitelist, an option to update only parsers without needing to update the extension will be helpful. This will enable users to convert their favourite websites into ebooks satisfactorily without waiting for the next release or installing from the source.

dteviot commented 5 years ago

I've thought about doing this, but there are some difficulties I have not been able to solve. For example, Google does NOT like extensions that download and run code that Google has not vetted. Adblock is getting data files so that;s OK, but parsers are executable code. And hence a big security issue. However, if you've got any suggestions how I might do this, I'd love to hear them.

asifm91 commented 5 years ago

Is downloading JSON files allowed? I can think of a JSON structure with query selector strings. It should work for simple sites like wuxiaworld.co:

{
  "bookTitle": {
    "selector": "div#info h1",
    "selectorIndex": "0"
  },
  "author": {
    "selector": "div#info p",
    "selectorIndex": "0",
    "substringStart": "7"
  }
  "chapterTitle": {
    "selector": "div.bookname h1",
    "selectorIndex": "0"
  },
  "chapterContent": {
    "selector": "div#content",
    "selectorIndex": "0"
  },
  ...
}

The generic code would be something like:

extractTitleImpl(dom) {
  let data = parserData.bookTitle;
  let element = dom.querySelector(data.selector).item(data.selectorIndex);
  if(element !== null) {
    text = element.textContent;
    if(data.hasOwnProperty('substringStart')) {
      text = data.hasOwnProperty('substringEnd') ? text.substring(data.substringStart, data.substringEnd) : text.substring(data.substringStart);
    }
    return text;
  }
  return super.extractTitleImpl(dom);
}

extractAuthor(dom) {
  let data = parserData.author;
  let element = dom.querySelector(data.selector).item(data.selectorIndex);
  if(element !== null) {
    text = element.textContent;
    if(data.hasOwnProperty('substringStart')) {
      text = data.hasOwnProperty('substringEnd') ? text.substring(data.substringStart, data.substringEnd) : text.substring(data.substringStart);
    }
    return text;
  }
  return super.extractAuthor(dom);
}

asifm91 commented 5 years ago

An alternative might be to allow user write (copy-paste) parser code in a textarea similar to how they can define stylesheet. Not sure if executing such code is allowed or not.

asifm91 commented 5 years ago

After reading through chrome's CSP and this tutorial I'm thinking something like the following might work:

Create a master file containing all parser codes added since latest release. Let's call this file newParsers.js. The raw url of this file would be something like: https://raw.githubusercontent.com/dteviot/WebToEpub/ExperimentalTabMode/plugin/js/newParsers.js
Update manifest.json to include: "content_security_policy": "script-src 'self' https://raw.githubusercontent.com; object-src 'self'",
Load the script using the method shown in the tutorial, by injecting a script element with src pointing to https://raw.githubusercontent.com/dteviot/WebToEpub/ExperimentalTabMode/plugin/js/newParsers.js

Alternatively, according to this stackoverflow answer, we can do a XMLHttpRequest to download the newParsers.js and use chrome.tabs.executeScript method to execute. I'm not sure about this solution, but the CSP doc does not say anything against it.

dteviot commented 5 years ago

@asifm91 Thank you for this. It will probably be a few days before I can fully respond to you.

dteviot commented 5 years ago

@asifm91 Thank you for your research. It certainly shows that Google currently allows extensions to download and exec() code that Google has not validated.
Note, this is one of the mechanisms that is used to get malware into extensions. So I suspect that any extension that does this is going to have Google taking a much harder look at it. (This might result in the extension being manually validated, which would slow down the approval process. (Or even removed from the Chrome store.) FWIW, when Mozilla was doing a manual validation of WebToEpub, it took weeks to get an update approved. It’s now passing the automated approval process and is approved within a few hours.) (I’m sure I read something along the lines that Google was going to stop allowing it sometime soon, although I can’t seem to find it. Note, it may have been Firefox. I do get warnings from Mozilla when I send an update, due to the “exec()” being used in the zip library WebToEpub uses.)

Taking a step back. You say

This will enable users to convert their favourite websites into ebooks satisfactorily without waiting for the next release or installing from the source.

This suggests the simplest solution is to update WebToEpub sooner. In theory, there’s no reason I can’t release it more often. Uploading to the Chrome store only takes a few minutes of my time. And Google seems to approve it automatically and make it available within 60 minutes.

So, there’s no technical reason I could not do this. It’s just that I personally don’t like to release code that I’m not completely confident about. (I’m always concerned that I’ll release a broken version of WebToEpub, and with 9,000+ people using it each week, I’ll get a lot of complaints. Or they’ll try using a not fully complete feature, have it fail, and again, complain.)

I could probably deal with that by having two versions of WebToEpub on the Chrome store. e.g. “Stable” and “Development” releases. (But that requires some extra bookkeeping work on my part.)

Allowing users to write their own parsers is what the Default parser is supposed to do. I’ve got another item on my ToDo list to improve that. See: https://github.com/dteviot/WebToEpub/issues/202 However, the problem I see with that is “what level of skill does a user have?” It’s sort of: Most people don’t have much programming knowledge, so it needs to be kept very simple. Which means it’s not enough for many sites. (Quite a few are not just simple searches for tags.) And if the site needs actual programming, for someone with javascript programming skills, installing from source isn’t a problem. (At least I think it isn’t, and I don’t recall hearing from anyone that it is an issue.)

So, it’s kind of a

How big a problem is it?
How much effort will it take to solve?
Is it worth the effort to try and fix? Or is there a better use for my limited time?

dteviot / WebToEpub

Suggestion: option to update parsers after installtion #213