danny0838 / webscrapbook

A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Mozilla Public License 2.0
850 stars 118 forks source link

can you help me write a capture helper that will capture webpages to a particular folder automatically #362

Closed grotesque closed 8 months ago

grotesque commented 8 months ago

I tried reading the documentation but I wasn't able to understand if it's possible

Capture helpers preprocess the captured content to improve the capture quality. It's defined using JSON syntax with the following format:

  [
    {
      "name": "AdsRemover",
      "description": "Remove unwanted content",
      "pattern": "/https?://example\\.com//i",
      "commands": [
        ["remove", "#ad-by-xxx"],
        ["unwrap", "#i-want-to-strip-this"],
        ...
      ]
    },
    {
      "name": "DeferredImageFixer",
      "description": "Save deferred images defined by data-*",
      "commands": [
        ["attr", {"css": "img[data-src]"}, "src", ["get_attr", null, "data-src"]],
        ["attr", {"css": "img[data-srcset]"}, "srcset", ["get_attr", null, "data-srcset"]],
        ...
      ]
    },
    ...
  ]

• name: a string to name this helper and can be used in debugging identification, optional.
• description: a string to annotate this helper, optional.
• disabled: disables this helper when the value is truthy, optional.
• debug: enable debug mode for this helper when the value is truthy, optional.
• pattern: a regular expression string (“/expression/flags”; same for followings) to restrict URLs this helper works for. Works for all pages when omitted.
• commands: Commands to run in order.
  • Available commands are:
    • ["html", selector, htmlText]: for each element matched by selector, set HTML content to htmlText.
    • ["text", selector, text]: for each element matched by selector, set text content to text.
    • ["attr", selector, name, value]: for each element matched by selector, set attribute name to value (or null to delete attribute).
    • ["attr", selector, [[name1, value1], [name2, value2], ...]]: for each element matched by selector, set multiple attributes.
    • ["attr", selector, {name1: value1, name2: value2, ...}]: for each element matched by selector, set multiple attributes.
    • ["css", selector, name, value, priority]: for each element matched by selector, set CSS property name to value (or null to delete property) with priority ("important" or "" or omitted).
    • ["css", selector, [[name1, value1], [name2, value2], ...]]: for each element matched by selector, set multiple CSS properties.
    • ["css", selector, {name1: value1, name2: value2, ...}]: for each element matched by selector, set multiple CSS properties.
    • ["remove", selector]: remove each node matched by selector.
    • ["unwrap", selector]: unwrap (remove but keep child nodes) each node matched by selector.
    • ["insert", selector, nodeData, mode, index]: for each node matched by selector as reference, insert a node defined by nodeData, with insertion method specified by mode and index.
      • nodeData can be a string (to insert a text node) or a JSON object with following properties:
        • name: name of the element node (e.g. “span”), or “#text” for a text node, or “#comment” for a comment node.
        • value: value (text content) of the node.
        • attrs: attributes of the element node, as [[name1, value1], [name2, value2], ...] or {name1: value1, name2: value2, ...}.
        • children: child nodes of the element node, as [nodeData1, nodeData2, ...].
      • mode can be:
        • "before": insert before the reference node. 
        • "after": insert after the reference node.
        • "insert": insert as the index-th child node of the reference element node (starting from 0).
        • "append" (default): insert as the last child node of the reference element node.
    • ["isolate", selector]: remove every node other than which matches selector and their ancestors and descendants. For HTML documents, nodes outside the BODY element (such as the HEAD element) are not affected.
    • ["for", selector, command1, command2, ...]: for each node matched by selector, run command# in order.
    • ["options", name, value]: overwrite capture option name with value, with the same format as the exported options (limited to “capture.*”).
    • ["options", [[name1, value1], [name2, value2], ...]]: overwrite multiple capture options.
    • ["options", {name1: value1, name2: value2, ...}]: overwrite multiple capture options.
  • A parameter of a command can also be a command, and below commands can be used to get a value:
    • ["has_node", selector]: whether selector matches at least one node.
    • ["has_attr", selector, name]: whether the first element matched by selector has the attribute name.
    • ["get_html", selector]: HTML content of the first element matched by selector.
    • ["get_text", selector]: text content of the first element matched by selector.
    • ["get_attr", selector, name]: value of attribute name of the first element matched by selector.
    • ["get_css", selector, name, getPriority]: value of style name of the first element matched by selector. Get style priority instead if getPriority is truthy.
    • ["match", text, pattern]: whether text matches the regular expression string pattern. If index is an integer, return the index-th subgroup of the match (0 for the matched string).
    • ["replace", text, pattern, replacement]: result of the string text with regular expression pattern replaced with replacement.
    • ["if", cond, thenValue, elseValue]: return thenValue if cond is truthy, or elseValue otherwise.
    • ["and", value1, value2, ...]: return the first falsy value in value#, or the last value.
    • ["or", value1, value2, ...]: return the first truthy value in value#, or the last value.
    • ["concat", value1, value2, ...]: concatenate multiple string values.
    • ["slice", text, beginIndex, endIndex]: return the extracted substring from beginIndex to endIndex of the string text.
    • ["upper", text]: text with letters converted to upper case.
    • ["lower", text]: text with letters converted to lower case.
    • ["encode_uri", text, safe]: percent-encode the string text to be used in a URI, with every char in string safe excluded from encoding.
    • ["decode_uri", text]: decode the percent-encoded string text.
    • ["add", value1, value2, ...]: arithmetic addition of value1 + value2 + ....
    • ["subtract", value1, value2, ...]: arithmetic subtraction of value1 - value2 - ....
    • ["multiply", value1, value2, ...]: arithmetic multiplication of value1 * value2 * ....
    • ["divide", value1, value2, ...]: arithmetic division of value1 / value2 / ....
    • ["mod", value1, value2, ...]: arithmetic modulo operation of value1 % value2 % ...。
    • ["power", value1, value2, ...]: arithmetic exponentiation of value1 ^ value2 ^ ....
  • Selector can be specified in many commands to select nodes according to the current reference node. The reference node of each command of commands is initially the root node of the captured page (which is usually the HTML element). Each node matched by selector of a command X will become the reference node of every command Y following the selector parameter of command X. The value of a selector can be either:
    • "self" or null: to select the current reference node.
    • "parent": to select the parent of the reference node.
    • "root": to select the root node.
    • "<CSS selector value>": to match nodes using a CSS selector.
    • {"css": "<CSS selector value>"}: to match nodes using a CSS selector.
    • {"xpath": "<XPath value>"}: to match nodes using an XPath.
  • When debug mode is enabled, commands with name prefixed with “*” will be logged in the console. For example: ["*get_attr", null, "data-src"]

When an error occurs or debug mode is enabled, details are available in the console of the captured page (or the capture dialog whening capturing source).
danny0838 commented 8 months ago

This is not supported, as the parent item is not a capture option and can be changed by how the capture is performed.

grotesque commented 8 months ago

Can there be a setting that captures all pages to a particular folder automatically?

danny0838 commented 8 months ago

That's not easy. As the "default item" is scrapbook-dependant, it would best be a server side config, and will require a large code rework anyway.

grotesque commented 8 months ago

okay. Thanks