Open saumier opened 3 weeks ago
Notes: The crawling works fine in a local machine, but fails when it is running in a github runner.
Task for @dev-aravind - Add the user-agent header in all steps of the crawling process, which includes fetching entity URLs, fetching entity details ( both headless and headful mode ).
@saumier will try and contact the Spec.qc.ca developer team to allow our user-agent to crawl their website.
@troughc I sent you an email for Isabelle to ask her tech team to allow the Artsdata crawler User Agent "artsdata-crawler/3.3.0"
Additional note: Artsdata crawler agent is "artsdata-crawler/3.3.0" however the tech teams have been informed to only match to "artsdata-crawler", because the version number (currently 3.3.0) changes with each update.
email was sent
@saumier The user-agent is now added to every step.
@dev-aravind the tech teams have been informed to only match to "artsdata-crawler", because the version number (currently 3.3.0) changes with each update.
@fjjulien Please let me know if you hear anything from Isabelle at Spec regarding our crawler being allowed in. Once the Artsdata crawler is allowed in I will run another crawl of their event JSON-LD.
When running the workflow for spec.qc.ca the system exits with an error:
Max retries reached. Unable to fetch the content for page .