ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

googleplus igset: Ignore more login URLs #109

Closed jodizzle closed 6 years ago

jodizzle commented 6 years ago

These URLs typically redirect to a '/ServiceLogin' URL, but it's good to prevent the excess requests.

As a side note, I'm wondering if it would be a good ideal to rename the googleplus igset to some more general? Or move parts of it into the global igset? These changes were inspired by trying to grab a sites.google.com site.

ivan commented 6 years ago

Yeah, I think all of the Google login/account ignores belong in the global igset, but I'm worried about

^https?://lh4\.googleusercontent\.com/proxy/[^/]+
^https?://plus\.google\.com/_/scs/apps-static/

because I have no idea what those are intended to block.

jodizzle commented 6 years ago

I was curious about those lines as well. As a basic test, I just tried running grab-site on a sample google plus site for a few thousand requests with --igon (I did not include my changes from this PR). There were no hits for ^https?://lh4\.googleusercontent\.com/proxy/[^/]+, and only two for ^https?://plus\.google\.com/_/scs/apps-static/:

https://plus.google.com/_/scs/apps-static/_/ss/k=oz.home.1df6en7rj4dq.L.F4.O/am=MA4E/d=0/rs=AGLTcCMRRwGO1a4BDKKinmk2-aGOPDXU2w https://plus.google.com/_/scs/apps-static/_/js/k=oz.home.en_US.gNBtXLBo2Sk.O/m=b,evt/am=MA4E/rt=j/d=1/rs=AGLTcCP0LI3Y6UzZy-VMOD3yJauPJAfRhw

Looks like some miscellaneous page requisites?

ivan commented 6 years ago

Thanks for checking that. I have added your ignores in 5442414d2856b8580b7a399200b78077ba2927b5; please let me know if you find any more.

jodizzle commented 6 years ago

Sure, sounds good. Thanks!