ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Regexp exclusion problem #169

Closed manueldeprada closed 4 years ago

manueldeprada commented 4 years ago

As seen in https://github.com/ArchiveTeam/grab-site/blob/8343916e3c9500074c5f5a46ee60ad3f75bba775/libgrabsite/wpull_hooks.py#L28

You use google's re2 to compile regular expressions.

The problem is that google's engines do not support lookaround or lookahead expressions.

So if in my ignore list, I write something like: ^https?://mydomain\.com/blabla/view\.php\?id=(?!4351).*

The app will crash, since the regexp cannot be compiled.

How I got there?

I want to grab like 20 URLs that look like mydomain.com/blabla/view.php?id=xxx. I can put the 20 URLs in my input-file argument.

But I want to ignore all the rest of view.php?id=xxx URLs (there are hundreds, and all of them are interlinked).

It would be desirable that the input-file argument overrided the ignores. But it doesn't. So if I tell grab-site to ignore all the URLs that look like mydomain.com/blabla/view.php?id=xxx, nothing gets downloaded.

So I came with the solution of using the regular expression you can see at the beginning. I don't know how can I achieve my goal, other than changing myself the regex engine.

Help would be much appreciated.

ivan commented 4 years ago

Thanks for the report. I used re2 for performance reasons only, so I believe it is fine to fall back to re when re2 fails to compile the combined regexp. Could you please tell me if 087e14517505057ceea3e0aff891dfd9529ec1b8 fixes your issue? I tagged it in 2.2.0.

manueldeprada commented 4 years ago

Thanks for the report. I used re2 for performance reasons only, so I believe it is fine to fall back to re when re2 fails to compile the combined regexp. Could you please tell me if 087e145 fixes your issue? I tagged it in 2.2.0.

it worked flawlessly, thank you very much!!!

for future releases, consider that input URLs are not affected by ignore lists. But it is fine for me as it is now!

Keep up the good work!!😀😀

ivan commented 4 years ago

Ignores shouldn't apply to any start URLs: https://github.com/ArchiveTeam/grab-site/blob/12e798b07578e351142b0e7e38a3f6f017f87b3d/libgrabsite/wpull_hooks.py#L360-L361 - let me know if you see otherwise.