adbar / courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html
Apache License 2.0
124 stars 9 forks source link

Support for custom user agents in is_live_page() #114

Open drFerg opened 2 months ago

drFerg commented 2 months ago

Hi!

We're currently using courlan via trafilatura for some crawling and found that when trying to do liveness checks for a hosts url we're being blocked due to user agent headers, however, we're unable to change them. I noticed there's some commented out code in the redirection test which the is_live_page uses that references user agent headers.

Is there any interest in supporting changing the headers or having a different one set?

Thanks.

adbar commented 2 months ago

Hi @drFerg, definitely, Trafilatura supports custom user-agent settings, courlan could also do so. The config file approach could be replicated here.

Are you interested in drafting a pull request?