cldellow / datasette-scraper

Add website scraping abilities to Datasette
Apache License 2.0
60 stars 1 forks source link

plugin: extract-links #23

Closed cldellow closed 1 year ago

cldellow commented 1 year ago
      // Extract link graph
      "extract-links": {
        // optional; absent implies .*
        "url-regex": ".*",

        // optional
        "database": "dbname",

        // optional; defaults to dss_links
        "table": "dss_links",
      }

Needs https://github.com/cldellow/datasette-scraper#extract_from_responsescraper-config-url-response

cldellow commented 1 year ago

Extracts from, to, anchor text.

Future improvements: extract main focus image and its alt text, if there is one. Extract dofollow/nofollow.