-
We have some "new" (some are few months old ...) CLI argument of browsertrix crawler to consider:
```
--seedFile, --urlFile If set, read a list of seed urls, on
…
-
**URL**
[https://www.instagram.com/elsdietvorst18](https://www.instagram.com/elsdietvorst18)
**Describe the bug**
Instagram behaviour only opens the first post of the row and ignores the two othe…
-
Recently came across the [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) project which seems to be using Brave Browser for crawls. Some of its features include `Support for c…
-
I'm using browsertrix to scrape a soon-to-be offline service at my university, and I wanted to share some gotchas I encountered. (I'll update this list when I encounter new issues.)
### Preserve se…
-
If I crawl a website with mostly static resources, I'm noticing there can be missing resources in the resulting WARC. The reason for that is either broken links or timeouts.
I have written tools to…
-
Currently it seems screenshot are made before custom behaviors.
It could be very interesting to be able a post-custom behaviors screenshot. For example to capture screenshot after removing the "acc…
-
### Recipe URL
https://farm.openzim.org/recipes/bibnum_fr_all
### Last log lines
```true
----------
Testing warc2zim args
Running: warc2zim --favicon=https://drive.farm.openzim.org/Corrected%20Lo…
-
Rather than our own `webrender-api`, consider switching to https://github.com/webrecorder/browsertrix-crawler
The integration pattern is somewhat different to Browsertrix's primary use case, but it…
-
Here is the druid with wacz file:
https://argo.stanford.edu/view/druid:bc725wm6775
The seed in SWAP
https://swap.stanford.edu/was/20240118154547/https://eastwindezine.com/
You can find Vimeo …
-
The [scrapy-playwright](https://github.com/scrapy-plugins/scrapy-playwright) project appears well supported and can supersede the current Selenium Hub approach (see e.g. [proxy support](https://github…