ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Add SimpleMachineForums igsets #201

Closed TheTechRobo closed 2 years ago

TheTechRobo commented 2 years ago

I might do this sometime, but I'm filing an issue anyway to keep organised. (Plus, I might not do it...I don't really know.)

With the forums igset, we should add some ignores for simplemachine.

Examples:

index.php\?action=(verificationcode|printpage|reporttm|emailuser|quickmod2)

verificationcode is literally just a captcha (example: http://72dpiarmy.supersanctuary.net/index.php?action=verificationcode;vid=post;rand=eda0e44af72d37c458c7d5369931d365); printpage simply is simply a transcript of the thread (http://72dpiarmy.supersanctuary.net/index.php?action=printpage;topic=10006.0), reporttm is the "report to mods" page (example: http://72dpiarmy.supersanctuary.net/index.php?action=reporttm;topic=26.6;msg=18416), and emailuser basically sends the link to the thread to an email address. (example: http://72dpiarmy.supersanctuary.net/index.php?action=emailuser;sa=sendtopic;topic=26.0). quickmod2 (example: http://72dpiarmy.supersanctuary.net/index.php?action=quickmod2;topic=994.930) is simply an error page.