Closed manueldeprada closed 4 years ago
Thanks for the report. I used re2
for performance reasons only, so I believe it is fine to fall back to re
when re2
fails to compile the combined regexp. Could you please tell me if 087e14517505057ceea3e0aff891dfd9529ec1b8 fixes your issue? I tagged it in 2.2.0
.
Thanks for the report. I used
re2
for performance reasons only, so I believe it is fine to fall back tore
whenre2
fails to compile the combined regexp. Could you please tell me if 087e145 fixes your issue? I tagged it in2.2.0
.
it worked flawlessly, thank you very much!!!
for future releases, consider that input URLs are not affected by ignore lists. But it is fine for me as it is now!
Keep up the good work!!😀😀
Ignores shouldn't apply to any start URLs: https://github.com/ArchiveTeam/grab-site/blob/12e798b07578e351142b0e7e38a3f6f017f87b3d/libgrabsite/wpull_hooks.py#L360-L361 - let me know if you see otherwise.
As seen in https://github.com/ArchiveTeam/grab-site/blob/8343916e3c9500074c5f5a46ee60ad3f75bba775/libgrabsite/wpull_hooks.py#L28
You use google's re2 to compile regular expressions.
The problem is that google's engines do not support lookaround or lookahead expressions.
So if in my ignore list, I write something like:
^https?://mydomain\.com/blabla/view\.php\?id=(?!4351).*
The app will crash, since the regexp cannot be compiled.
How I got there?
I want to grab like 20 URLs that look like
mydomain.com/blabla/view.php?id=xxx
. I can put the 20 URLs in my input-file argument.But I want to ignore all the rest of view.php?id=xxx URLs (there are hundreds, and all of them are interlinked).
It would be desirable that the input-file argument overrided the ignores. But it doesn't. So if I tell grab-site to ignore all the URLs that look like
mydomain.com/blabla/view.php?id=xxx
, nothing gets downloaded.So I came with the solution of using the regular expression you can see at the beginning. I don't know how can I achieve my goal, other than changing myself the regex engine.
Help would be much appreciated.