ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
356 stars 72 forks source link

Fix ignores sometimes not being applied correctly due to thread-related race conditions #493

Closed JustAnotherArchivist closed 3 years ago

JustAnotherArchivist commented 3 years ago

set_patterns is called by ListenerWorkerThread and modifies variables used by the main wpull thread. This introduces several race conditions. For example, this means that _compiled may rarely get cleared during the iteration in ignores, and patterns might change during compilation.

For performance reasons, it's undesirable to use a lock on every call to ignores, especially as that method might get called many millions of times on large jobs. This implements an intermediate solution that avoids such locking. _compiled is only changed by the main thread, therefore the actual ignoring can't be affected by the updates. The updater thread might change _primary and patterns during or after the recompilation condition, however, such a change would always also require a recompilation anyway, and that is guaranteed to happen either during that ignores call or on the next one. The same applies if the change occurs after the recompilation. Although it doesn't completely prevent them, this mitigates possible race conditions.

JustAnotherArchivist commented 3 years ago

I've tested the particular race condition we experienced on job 1mnfkzd070ichqvewbzx62kmy manually with some minor modifications to artificially trigger the race condition.

Ignoracle patch ```diff diff --git a/pipeline/archivebot/wpull/ignoracle.py b/pipeline/archivebot/wpull/ignoracle.py index d50c411..f5763c7 100644 --- a/pipeline/archivebot/wpull/ignoracle.py +++ b/pipeline/archivebot/wpull/ignoracle.py @@ -80,6 +80,7 @@ class Ignoracle(object): patterns.append(string) with self._lock: + print('resetting patterns') self.patterns = patterns self._primary = None # Don't replace _compiled here; _primary acts as a trigger for the recompilation. @@ -111,11 +112,15 @@ class Ignoracle(object): self._primary = (primaryUrl, primaryNetloc) for pattern, compiled in self._compiled: + print(f'checking {pattern}') match = compiled.search(url_record.url) if match: return pattern + import time + time.sleep(1) + return False # vim: ts=4:sw=4:et:tw=78 ```
test.py ```python import ignoracle import threading import time import wpull.pipeline.item patterns = ['/a/', '/b/'] class UpdaterThread(threading.Thread): def __init__(self, ignoracle, *args, **kwargs): self._ignoracle = ignoracle super().__init__(*args, **kwargs) def run(self): time.sleep(1.5) self._ignoracle.set_patterns(patterns) r = wpull.pipeline.item.URLRecord() r.url = 'https://example.org/b/' r.parent_url = 'https://example.org/' r.level = 2 i = ignoracle.Ignoracle() i.set_patterns(patterns) t = UpdaterThread(i) t.start() print('checking after starting thread') print(i.ignores(r)) # 1 second will have passed here print('checking again') print(i.ignores(r)) # After checking /a/, UpdaterThread will reset the patterns t.join() print('checking after thread join') print(i.ignores(r)) ```
Output ``` resetting patterns checking after starting thread checking /a/ checking /b/ /b/ checking again checking /a/ resetting patterns checking /b/ /b/ checking after thread join checking /a/ checking /b/ /b/ ```
JustAnotherArchivist commented 3 years ago

Thanks! Will merge shortly.

* I'm unsure of the purpose of setting `self._primary = None` in the original code

That still serves the same purpose: triggering a compilation on the first call to ignores. Since obviously None != (primaryUrl, primaryNetloc) will always be True. Yeah, it's a hack. The advantage is that there's only one condition to check inside ignores. That probably doesn't matter performance-wise though.

* I don't actually know whether `self.patterns = patterns` is atomic.  If it is, we could likely avoid the lock entirely, if we took a reference of self.patterns in ignores before processing it into the compiled list.  That could be much slower though, depending on whether copy on write is used.

Well, it's complicated...

self.patterns = patterns compiles to bytecode like this in CPython 3.9:

  5           0 LOAD_FAST                0 (patterns)
              2 LOAD_GLOBAL              0 (self)
              4 STORE_ATTR               1 (patterns)

The actual assignment happens within a single bytecode, and due to the GIL, only one bytecode can be executed at the same time in CPython. So in that sense, it is atomic.

However, this isn't actually specified anywhere as far as I know. Nothing about thread-safety is, I believe, except for the code that's explicitly thread-safe like in the threading module. So if something lifts the GIL in CPython or if you use a different implementation that doesn't have a GIL (like Jython or IronPython), it could break down, I guess.

* If the data in _compiled actually has to vary according to the current base URL (does it?), it seems very silly to keep a single one in global state.  Alternating base URLs would be a worst case for performance.  We may consider storing a hashmap of primaryUrl -> _patterns.

Yeah, patterns may need to be recompiled if the base URL changes because the values of {primary_url} and {primary_netloc} change, which are substituted before the compilation. That's used in the singletumblr igset, for example.

I actually thought about using a dict mapping primary URLs to patterns when I originally implemented the compilation a few years ago (#262). The main downside is that you then need to keep the compiled pattern objects once per primary URL, andhat will quickly blow up memory consumption for jobs of large URL lists. For example, one re.compile(r'^https?://example\.org/') object consumes 308 bytes in a quick test on CPython 3.9.1. Every job currently has at least 212 ignores (from the global igset), so that's somewhere around 65 kB of pattern objects. Even a tiny Twitter scrape job list of a few thousand URLs would mean a couple hundred MB of RAM then, and jobs with millions of URLs in the list aren't rare. So that isn't feasible.

Also, your worst case actually only applies on the initial iteration over the URL list due to wpull's breadth-first recursion. Once it's done with that, it will have larger blocks of URLs discovered on each of the initial URLs, which obviously share the primary_url and primary_netloc values, so no recompilation takes place then until it gets to the block of URLs from the next initial URL.

There is definitely room for improvement though. In particular, only patterns that use those special placeholders need to be recompiled, so the other objects should be kept.

That said, it appears that all of this isn't actually as bad as it seemed at the time when I implemented #262. I just looked with py-spy at Twitter scrape job 2x0k6405fu19mms9y28ymvdlb while it was running through its ~130k URL list, i.e. it should represent the worst case you mentioned. re._compile only accounted for 0.13 s in a sampling across about 31 s... Maybe that depends somewhat on the exact circumstances of a job, though I don't see how right now. I remember that it showed up as a major slow point in some performance measurements at the time but have no idea what the details were.

* The logic for recompiling all the ignore patterns is fragile.  If we mean to recompile them whenever the compiled array is empty, we should say that instead of relying on `self._primary = None` to force recompilation.  Too clever by half.

We can't clear the compiled list because it might be iterated over by the other thread at the same time. We'd have to use a lock or make a copy of the list. I tried the latter but found that messier. Also, we'd still have the _primary condition anyway. Perhaps a cleaner implementation would be to have a self._mustRecompile bool that is set to True in set_patterns and OR'd with the self._primary comparison. I didn't think this through regarding thread safety though.

Overall, I don't really want to spend more time on the existing ignoracle except for fixing bugs like in this PR. It will always have pretty horrible performance due to wpull checking each URL individually out of and back into the DB, URL parsing, and other things. #446 and the idea of silent ignores that are kept out of DB and log entirely would be a better approach.