danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
908 stars 121 forks source link

Some Essential Feature Request #343

Closed tathastu871 closed 1 year ago

tathastu871 commented 1 year ago

1) Capture Links from clipboard/textfile/list 2) Capture regex based links also permuation eg. https://site.com/page?=[1-9] --> will generate link for each page 1 to 9 and scrape them.This is essential if i had to scrape 100 pages and instead of writing individual url for each page 3) Capture resources --> ASYNCRONOUSLY + PARALLEL THREADING SUPPORT 4) ADD SUPPORT FOR INJECTING JAVASCRIPTS SNIPPETS OR BOOKMARKLETS BEFORE SCRAPING EACH PAGE

EG. i had site where i had to execute a small javascript to remove certain elements and then scrape them

If scrape can take as input sequential javascript functions or bookmarklets to be executed on each page

Or on specific page based on regex. ((THE WAY TAMPERMONKEY USERSCRIPTS EXECUTE ONLY ON SITES DEFINED BY REGEX))

tathastu871 commented 1 year ago

ALSO ADD SUPPORT FOR CAPTURING ONLY LINKS RATHER THAN WHOLE PAGE

EG. I OPEn a site and capture links -> Execute link capture again on first group of links Screenshot_20230515-212415_Kiwi Browser

Here add another button capture link that will just scrape href links from pages that are listed and thenfrom resulting href link can again be captured fot another href links

It is needed if user dont want to caprure entire pages rather just recursively scrape links

danny0838 commented 1 year ago
  1. Capture Links from clipboard/textfile/list

I don't see what "list" means. It it means like links in a web page, you can select them and invoke a Capture selected links for that.

Other cases can be done by invoking a Capture selected links and paste the URLs in the dialog. I don't see a plus to implement an extra command for that.

  1. Capture regex based links also permuation eg. https://site.com/page?=[1-9] --> will generate link for each page 1 to 9 and scrape them.This is essential if i had to scrape 100 pages and instead of writing individual url for each page

The permuated URL list can be easily generated using Excel, OpenOffice Calc, Google SpreadSheet, etc., and be applied through pasting into the dialog of Capture selected links, as previously mentioned. I don't see a big plus to implement an extra command for that. Also it's not easy to define a good placeholder set without conflicting with a real URL.

  1. Capture resources --> ASYNCRONOUSLY + PARALLEL THREADING SUPPORT

I don't get this. Please provide a more detailed description about the related real word use cases.

  1. ADD SUPPORT FOR INJECTING JAVASCRIPTS SNIPPETS OR BOOKMARKLETS BEFORE SCRAPING EACH PAGE

EG. i had site where i had to execute a small javascript to remove certain elements and then scrape them

If scrape can take as input sequential javascript functions or bookmarklets to be executed on each page

Or on specific page based on regex. ((THE WAY TAMPERMONKEY USERSCRIPTS EXECUTE ONLY ON SITES DEFINED BY REGEX))

Unfortunately this is NOT POSSIBLE as the browser extension framework does not allow arbitrary JavaScript code execution due to a security concern, and any similar approach (such as embedding a JavaScript interpreter using JavaScript) is also explicitly forbidden by the policy of the extension store.

Some possible alternative approaches:

1) Use the capture helper. This is limited in functionality but should be able to work for many useful cases. You can request an adequate extension for that for a good real world use case.

2) Configure Tampormonkey/Userscript to do the automated programmatical web page modification when you visit a web page. (addendum: script injection is only allowed for a content script (i.e. run within the visited web page, NOT the captured web page content), and this feature will likely be removed by Manifest V3))

3) Write your own extension (or temporary extension) to do the automated programmatical web page modification and invoke a WebScrapBook capture through the external message API (an incomplete doc can be found here).

ALSO ADD SUPPORT FOR CAPTURING ONLY LINKS RATHER THAN WHOLE PAGE

EG. I OPEn a site and capture links -> Execute link capture again on first group of links

Here add another button capture link that will just scrape href links from pages that are listed and thenfrom resulting href link can again be captured fot another href links

It is needed if user dont want to caprure entire pages rather just recursively scrape links

I don't get this. What do you mean "capture only links"? If you mean capture bookmark, it can be easily achieved through the advanced mode of the Capture As dialog.

tathastu871 commented 1 year ago

Some error on kiwi android always Fatal error: Failed to download "WebScrapBook/data/20230517102349918/index.html": Unable to download to the folder.

danny0838 commented 1 year ago

Some error on kiwi android always Fatal error: Failed to download "WebScrapBook/data/20230517102349918/index.html": Unable to download to the folder.

This is an issue of Kiwi browser and we cannot really fix it, but you can bypass it through tweaking the capture options. See #295 for more details.

In the future please raise an unrelated issue in a new thread so that it can be properly traced independently.