biolds / sosse

Selenium Open Source Search Engine & crawler
GNU Affero General Public License v3.0
27 stars 3 forks source link

Wayback Machine links completely break crawling #3

Open NinCollin opened 8 months ago

NinCollin commented 8 months ago

I'm having an issue where Wayback Machine links breaks crawling on completely unrelated pages This page has links to two Wayback Machine links, this one and this one.

After crawling the page with those links, subsequent unrelated websites fail to be crawled with an error message pertaining to the previous two Wayback Machine links, despite the fact that the sites that the error occurs on are completely unrelated, and not even on the same domain. SOSSE also fails to cache them too.

Below are some screenshots showing how the error is unrelated to the failed crawled pages image image

biolds commented 8 months ago

It seems the crawler has reached a broken state, due to a previously crawled page having bogus links (most likely the Wayback machine page indeed). As a work-around, you could probably recrawl the wayback machine pages using Python Request instead of Chromium. As for the tilde.town they can most likely be recrawled as is after restarting the crawler. Otherwise, I'll have a look tonight to fix the root of the issue.

biolds commented 8 months ago

It looks like a bug in Selenium, I have opened https://github.com/SeleniumHQ/selenium/issues/12906 . I'll implement a work-around in the mean time.

Edit: The bug is actually in Chromedriver, I have opened an other issue there: https://bugs.chromium.org/p/chromedriver/issues/detail?id=4589

biolds commented 8 months ago

@NinCollin I have released a new version that adds support for crawling with Firefox, this way Wayback Machine pages can crawled!