Princeton-CDH / derrida-django

Derrida's Margins - Python/Django web application
https://derridas-margins.princeton.edu
Apache License 2.0
8 stars 1 forks source link

Configure custom driver for browsertrix for visualization and deep zoom #296

Closed kmcelwee closed 2 years ago

kmcelwee commented 2 years ago

Use the --driver flag or customize run.sh to point the crawl to a custom JS file (inspired by the default: https://github.com/webrecorder/browsertrix-crawler/blob/main/defaultDriver.js). We need a custom driver that can interact with the visualization page and deep zoom.

Notes from Ilya

module.exports = async ({data, page, crawler}) => {
  await crawler.loadPage(page, data);

  ...
  const moreUrls = ["https://example.com/a", "https://example.com/b", ...];
  crawler.queueInScopeUrls(data.seedId, moreUrls, data.depth);
};

Remaining Todos

kmcelwee commented 2 years ago

We have it working in puppeteer: https://gist.github.com/kmcelwee/cdbb6d2b4a5c2d9ac234d6de5db4716c

kmcelwee commented 2 years ago

When running custom driver locally, we can use docker volumes to override defaulDriver.js with our driver.

docker run -v $PWD/custom-driver.js:/app/defaultDriver.js -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://derridas-margins.princeton.edu/library/abraham-oeuvres-completes-1966/gallery/front-cover/ --limit 2 --generateWACZ --text --collection deep-zoom
kmcelwee commented 2 years ago

It will be easier to simply provide the extra info.json as a seedlist to the scrape instead of including it in the custom driver.

kmcelwee commented 2 years ago

Addressed by https://github.com/Princeton-CDH/cdh-ansible/pull/108

rlskoeser commented 2 years ago

@kmcelwee I think these are the relevant parts of ansible browsertrix role:

copy file — maybe just change the destination filename? https://github.com/Princeton-CDH/cdh-ansible/blob/main/roles/browsertrix/tasks/main.yml#L32-L37

crawl script — adjust command line argument https://github.com/Princeton-CDH/cdh-ansible/blob/main/roles/browsertrix/templates/crawl.sh.j2#L7