ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Add SimpleMachineForum ignores to `forums` igset #203

Closed TheTechRobo closed 2 years ago

TheTechRobo commented 2 years ago

fixes #201.

Please test! I'm testing it myself but 2 pairs of eyes is better than one :-)

I'm going to look into if profiles are always only available to registered users, or if it's just sometimes. If it is, I'll consider adding it Nevermind, peopel can add cookies to grab-site - forgot about that.

ivan commented 2 years ago

Thanks. Do you have some links to SMF sites I can test this on?

Do you know if printpage should be excluded for being completely redundant? (Some forum software puts all the posts in a thread onto the printpage without pagination, which can be useful.)

TheTechRobo commented 2 years ago

I've got two:

https://www.nextcomputers.org/forums/index.php

and

http://72dpiarmy.supersanctuary.net/

BTW, the latter does not work on https, at least on my FF Developer Edition. Seems it uses a very old security suite, which lines up with the fact that the game it is about hasn't been updated for about a decade :P

TheTechRobo commented 2 years ago

Oh, yeah - you're right! printpage gets rid of pagination.

Should I remove it from the igset?

(Looks like my current two crawls just won't have printpage, which isn't the worst thing tho). I'm not restarting them - too far in lmao

ivan commented 2 years ago

Should I remove it from the igset?

yes

TheTechRobo commented 2 years ago

Done. Any other thoughts?

ivan commented 2 years ago

Thanks!