Closed jodizzle closed 6 years ago
Yeah, I think all of the Google login/account ignores belong in the global igset, but I'm worried about
^https?://lh4\.googleusercontent\.com/proxy/[^/]+
^https?://plus\.google\.com/_/scs/apps-static/
because I have no idea what those are intended to block.
I was curious about those lines as well. As a basic test, I just tried running grab-site on a sample google plus site for a few thousand requests with --igon
(I did not include my changes from this PR). There were no hits for ^https?://lh4\.googleusercontent\.com/proxy/[^/]+
, and only two for ^https?://plus\.google\.com/_/scs/apps-static/
:
https://plus.google.com/_/scs/apps-static/_/ss/k=oz.home.1df6en7rj4dq.L.F4.O/am=MA4E/d=0/rs=AGLTcCMRRwGO1a4BDKKinmk2-aGOPDXU2w https://plus.google.com/_/scs/apps-static/_/js/k=oz.home.en_US.gNBtXLBo2Sk.O/m=b,evt/am=MA4E/rt=j/d=1/rs=AGLTcCP0LI3Y6UzZy-VMOD3yJauPJAfRhw
Looks like some miscellaneous page requisites?
Thanks for checking that. I have added your ignores in 5442414d2856b8580b7a399200b78077ba2927b5; please let me know if you find any more.
Sure, sounds good. Thanks!
These URLs typically redirect to a '/ServiceLogin' URL, but it's good to prevent the excess requests.
As a side note, I'm wondering if it would be a good ideal to rename the
googleplus
igset to some more general? Or move parts of it into theglobal
igset? These changes were inspired by trying to grab a sites.google.com site.