Switch the MirrorList server to use threading rather than forking

puiterwijk commented 8 years ago

This improves the performance significantly for several reasons:

Threading uses a lot less resources than forking in general.
Forks have Copy-On-Write for variables. This means that at the moment of forking, they will just point to the parent process for all of the memory they stored until now, so they don't have to actually copy all of the bits of memory over during fork(). The problem here though is at the moment of metalink update: at this moment, the variables that the master proces uses for the in-memory structure suddenly all change. This means that all of the forks that are still active at that time will have to copy all the previous values to make sure they do not get impacted by the update. (Linux-specific, I am not sure about other Operating Systems on this).

One possible reason I could think threads have been used rather than forks in the past is that with forking the childs do not get impacted at the moment of metalink update. However, since the Operating System has cross-thread locks, this is of no issue with regards to locking (the OS will take care of it). The duration of these locks isn't a problem either since the only change that happens under the lock is the swapping of one pointer (the old value) to a new value (the just parsed value).

Signed-off-by: Patrick Uiterwijk puiterwijk@redhat.com

ralphbean commented 8 years ago

@puiterwijk any considerations with respect to the Python GIL?

puiterwijk commented 8 years ago

In this case, the overhead caused by the Copy-On-Write vastly outweighs the GIL in my opinion and experience with tests.

This is mostly caused by not a lot of code running per child (just one single mirrorlist call and some data marshaling), and a lot of I/O happening (reading requests/writing responses happens inside the children).

mdomsch commented 8 years ago

In MM1, tcp setup outweighed any computation time in the mirrorlist code. Clients were getting answers in 0.3s which I considered good enough. What is the response time now? On Oct 18, 2015 6:45 PM, "Patrick Uiterwijk" notifications@github.com wrote:

In this case, the overhead caused by the Copy-On-Write vastly outweighs the GIL in my opinion and experience with tests.

This is mostly caused by not a lot of code running per child (just one single mirrorlist call and some data marshaling), and a lot of I/O happening (reading requests/writing responses happens inside the children).

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149064912 .

puiterwijk commented 8 years ago

When testing on the local machine (so skipping the SSL and reverse proxy stuff in Fedora infrastructure), the entire curl command takes 0m0.173s.

mdomsch commented 8 years ago

Ok that is about what I remember. Not sure threading buys a lot then.

The various pickle lists are edited in memory in each process exactly because I knew it was using forking. You will have to find those and make them explicit copies to make use of threading. On Oct 19, 2015 6:40 AM, "Patrick Uiterwijk" notifications@github.com wrote:

When testing on the local machine (so skipping the SSL and reverse proxy stuff in Fedora infrastructure), the entire curl command takes 0m0.173s.

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149190719 .

puiterwijk commented 8 years ago

Well, the old system only edits the master process: the only place where the SIGHUP handler gets attached to sighup_handler is in the main function, all other threads get SIG_IGN for SIGHUP. This means that only the master process, which is listening for new connections, gets the updated pickle, and all children will not. This is where the copy-on-write for the fork() system call comes into play: the master process gets an updated set of variables for the in-memory pickle structures, but to prevent that from syncing out to all children, every child now has to clone the entire old in-memory pickle structure. Especially after we increased the number of children to 80, this was causing a massive processing power and memory burden at the moment of pickle updates. With threading, all children just use the master process' variables, and when those change they follow along.

puiterwijk commented 8 years ago

This last commit makes sure the entire in-memory structure gets updated all at once.

This will during the parsing have the entire structure in memory twice, but that's not as many as the 40 to 80 copies the copy-on-write was delivering.

mdomsch commented 8 years ago

Children are very short lived and only for a single client. I did not want their memory contents to change underneath them during execution, hence they ignore sighup and finish their job unimpeded.

You are welcome to use threads, but look at where the lists are changed, specifically tree_lookup(): def tree_lookup(tree, ip, field, maxResults=None): # fast lookup in the tree; if present, find all the matching values by # deleting the found one and searching again this is safe w/o copying # the tree again only because this is the only place the tree is used, # and we'll get a new copy of the tree from our parent the next time it # fork()s. # returns a list of tuples (prefix, data)

Copy-on-write ensured that the parent process (which only forks children) had the current pickle data; each child would get a copy-on-write of that data (which doesn't consume any additional memory except for the tree_lookup() calls) so while top reports lots of memory in use, nearly all of it is a read-only copy of the parent process. And forking the parent is pretty fast then.

You can fairly easily fix tree_lookup() to be thread-safe by forcing it to make a thread-local copy of the tree before modifying it.

Thanks, Matt

On Oct 19, 2015 7:51 AM, "Patrick Uiterwijk" notifications@github.com wrote:

Well, the old system only edits the master process: the only place where the SIGHUP handler gets attached to sighup_handler is in the main function, all other threads get SIG_IGN for SIGHUP. This means that only the master process, which is listening for new connections, gets the updated pickle, and all children will not. This is where the copy-on-write for the fork() system call comes into play: the master process gets an updated set of variables for the in-memory pickle structures, but to prevent that from syncing out to all children, every child now has to clone the entire old in-memory pickle structure. Especially after we increased the number of children to 80, this was causing a massive processing power and memory burden at the moment of pickle updates. With threading, all children just use the master process' variables, and when those change they follow along.

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149204417 .

puiterwijk commented 8 years ago

Thank you very much for this piece of information! This last patch should handle tree_lookup by always making a local copy.

puiterwijk commented 8 years ago

Looking into the py-radix library, they have added a new function called search_covered. If we would update the packages and the requirement to the latest version (0.9.3), we could use that functions to do the tree_lookup for us instead of having to do a copy or in-memory edit, and just grab the first maxResults entry from its result.

puiterwijk commented 8 years ago

A patch to make it use the search_covered: I am currently using arbitrary max bitlengths to look up.

diff --git a/mirrorlist/mirrorlist_server.py b/mirrorlist/mirrorlist_server.py
index bf463c9..9bc58da 100755
--- a/mirrorlist/mirrorlist_server.py
+++ b/mirrorlist/mirrorlist_server.py
@@ -206,8 +206,16 @@ def tree_lookup(tree, ip, field, maxResults=None):
     len_data = 0
     if ip is None:
         return result
-    node = ltree.search_best(ip.strNormal())
-    while node is not None:
+    # Determine a max bitlength we want to match against
+    # The values have been chosen arbitrarily
+    max_bitlen = 8
+    if ':' in ip:
+        # IPv6, let's limit further
+        max_bitlen = 32
+
+    ip = '%s/%i' % (ip, max_bitlen)
+    lresults = ltree.serch_covered(ip)[:maxResults]
+    for result in lresults:
         prefix = node.prefix
         if type(node.data[field]) == list:
             len_data += len(node.data[field])
@@ -215,11 +223,6 @@ def tree_lookup(tree, ip, field, maxResults=None):
             len_data += 1
         t = (prefix, node.data[field],)
         result.append(t)
-        if maxResults is None or len_data < maxResults:
-            ltree.delete(prefix)
-            node = ltree.search_best(ip.strNormal())
-        else:
-            break
     return result

mdomsch commented 8 years ago

Typo ltree.search_covered() On Oct 19, 2015 6:29 PM, "Patrick Uiterwijk" notifications@github.com wrote:

A patch to make it use the search_covered: I am currently using arbitrary max bitlengths to look up.

diff --git a/mirrorlist/mirrorlist_server.py b/mirrorlist/mirrorlist_server.py index bf463c9..9bc58da 100755 --- a/mirrorlist/mirrorlist_server.py +++ b/mirrorlist/mirrorlist_server.py @@ -206,8 +206,16 @@ def tree_lookup(tree, ip, field, maxResults=None): len_data = 0 if ip is None: return result

node = ltree.search_best(ip.strNormal())

while node is not None:

Determine a max bitlength we want to match against

The values have been chosen arbitrarily

max_bitlen = 8

if ':' in ip:

IPv6, let's limit further

max_bitlen = 32 +

ip = '%s/%i' % (ip, max_bitlen)

lresults = ltree.serch_covered(ip)[:maxResults]

for result in lresults: prefix = node.prefix if type(node.data[field]) == list: len_data += len(node.data[field]) @@ -215,11 +223,6 @@ def tree_lookup(tree, ip, field, maxResults=None): len_data += 1 t = (prefix, node.data[field],) result.append(t)

if maxResults is None or len_data < maxResults:

ltree.delete(prefix)

node = ltree.search_best(ip.strNormal())

else:

break return result

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149375764 .

puiterwijk commented 8 years ago

Thanks.

What way of fixing would you prefer?

deepcopy
bump required py-radix and use search_covered

puiterwijk commented 8 years ago

The deep copy method is not going to work in practice, given the size of the trees and objects it has to copy. We should either use search_covered, or implement our own version of it. I will push my fixed commit with search_covered.

puiterwijk commented 8 years ago

Never mind, this does not do what I wanted it to do. I need to think this over some more.

puiterwijk commented 8 years ago

So my logic was correct, but I was using the wrong one. Using search_covering does the trick.

ralphbean commented 8 years ago

@puiterwijk, so this code looks reasonable to me, but I don't have a basis to independently decide if it is actually better or worse with the change. Can we try this in staging or a cloud node, hit it with a farm of 'ab' runs, and generate some numbers about how it performs with and without the change?

pypingou commented 8 years ago

So what's the current status of this PR? @puiterwijk is this code the one we have running now in prod (hotfix during the freeze?)?

puiterwijk commented 8 years ago

This is indeed the code we are now running in production. I will do some performance tests soon.

adrianreber commented 8 years ago

I have run ab against both versions (threaded and forking) of the mirrorlist servers:

for i in 10 100 200 400; do ab -n 5000 -c $i 'https://mirrors.stg.fedoraproject.org/metalink?repo=epel-7&arch=x86_64'; done

http://lisas.de/~adrian/ab.thread.log http://lisas.de/~adrian/ab.fork.log

The following gives an overview of the longest requests for each concurrency level:

Concurrency	10	100	200	400
Forking (worst)	1861	2347	7163	76404
Threaded (worst)	1794	4028	8785	38452

Except for the case with the highest concurrency (400) the results are not very conclusive. As we are aiming for lower concurrency values in the production setup it makes the results even harder to interpret.

My experience with the new threaded mirrorlist server implementation, however, is that is has handled the release pretty good.

pypingou commented 8 years ago

Looking at the Time per request, the difference is quite a bit bigger ^_^ (at least for the two middle ones)

mdomsch commented 8 years ago

It's about 23ms longer using forking rather than threading. But that's only about 3% faster. Either model works fine for our limited usage.

On Fri, Nov 27, 2015 at 10:39 AM, Pierre-Yves Chibon < notifications@github.com> wrote:

Looking at the Time per request, the difference is quite a bit bigger ^_^

— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-160172191 .

puiterwijk commented 8 years ago

Problems came up when the number of requests the proxies would pass through becomes significantly bigger than the number of threads used. This will not happen with a single reverse proxy as it just has a limited set of backend connections, but did happen with 12 reverse proxies in combination with a few thousand requests pretty much at the same time. (as we have had when one network send about 50.000 thousand requests within a single minute, spread over the proxies).

pypingou commented 8 years ago

Everyone seems to agree with this change (or don't mind) and it has been running in prod for a while making it at least as good as the previous implementation.

Since I'm about to cut a new release, I'm merging this and it will be part of the release.

fedora-infra / mirrormanager2

Switch the MirrorList server to use threading rather than forking #140

Determine a max bitlength we want to match against

The values have been chosen arbitrarily

IPv6, let's limit further