InternetHealthReport / internet-yellow-pages

A knowledge graph for the Internet
https://iyp.iijlab.net
GNU General Public License v3.0
43 stars 18 forks source link

Citizenlab crawler not working anymore #62

Closed romain-fontugne closed 1 year ago

romain-fontugne commented 1 year ago

Describe the bug The citizen lab crawler is not anymore pushing data to the database.

To Reproduce In recent dumps (https://exp1.iijlab.net/wip/iyp/dumps/2023/07/22/iyp-2023-07-22.dump) the following query gives no result:

MATCH p = (:URL)-[:CATEGORIZED]-(:Tag) return p

Running the crawler is not adding new data neither:

python -m iyp.crawlers.citizenlab.urldb

Additional context I'm not quite sure, but it seems that github behave differently than before to web scraping. I think we should use the github api, it is more reliable that reading the html from github. See iyp/crawlers/inetintel/as_org.py for an example.