Closed puiterwijk closed 8 years ago
@puiterwijk any considerations with respect to the Python GIL?
In this case, the overhead caused by the Copy-On-Write vastly outweighs the GIL in my opinion and experience with tests.
This is mostly caused by not a lot of code running per child (just one single mirrorlist call and some data marshaling), and a lot of I/O happening (reading requests/writing responses happens inside the children).
In MM1, tcp setup outweighed any computation time in the mirrorlist code. Clients were getting answers in 0.3s which I considered good enough. What is the response time now? On Oct 18, 2015 6:45 PM, "Patrick Uiterwijk" notifications@github.com wrote:
In this case, the overhead caused by the Copy-On-Write vastly outweighs the GIL in my opinion and experience with tests.
This is mostly caused by not a lot of code running per child (just one single mirrorlist call and some data marshaling), and a lot of I/O happening (reading requests/writing responses happens inside the children).
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149064912 .
When testing on the local machine (so skipping the SSL and reverse proxy stuff in Fedora infrastructure), the entire curl command takes 0m0.173s.
Ok that is about what I remember. Not sure threading buys a lot then.
The various pickle lists are edited in memory in each process exactly because I knew it was using forking. You will have to find those and make them explicit copies to make use of threading. On Oct 19, 2015 6:40 AM, "Patrick Uiterwijk" notifications@github.com wrote:
When testing on the local machine (so skipping the SSL and reverse proxy stuff in Fedora infrastructure), the entire curl command takes 0m0.173s.
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149190719 .
Well, the old system only edits the master process: the only place where the SIGHUP handler gets attached to sighup_handler is in the main function, all other threads get SIG_IGN for SIGHUP. This means that only the master process, which is listening for new connections, gets the updated pickle, and all children will not. This is where the copy-on-write for the fork() system call comes into play: the master process gets an updated set of variables for the in-memory pickle structures, but to prevent that from syncing out to all children, every child now has to clone the entire old in-memory pickle structure. Especially after we increased the number of children to 80, this was causing a massive processing power and memory burden at the moment of pickle updates. With threading, all children just use the master process' variables, and when those change they follow along.
This last commit makes sure the entire in-memory structure gets updated all at once.
This will during the parsing have the entire structure in memory twice, but that's not as many as the 40 to 80 copies the copy-on-write was delivering.
Children are very short lived and only for a single client. I did not want their memory contents to change underneath them during execution, hence they ignore sighup and finish their job unimpeded.
You are welcome to use threads, but look at where the lists are changed, specifically tree_lookup(): def tree_lookup(tree, ip, field, maxResults=None): # fast lookup in the tree; if present, find all the matching values by # deleting the found one and searching again this is safe w/o copying # the tree again only because this is the only place the tree is used, # and we'll get a new copy of the tree from our parent the next time it # fork()s. # returns a list of tuples (prefix, data)
Copy-on-write ensured that the parent process (which only forks children) had the current pickle data; each child would get a copy-on-write of that data (which doesn't consume any additional memory except for the tree_lookup() calls) so while top reports lots of memory in use, nearly all of it is a read-only copy of the parent process. And forking the parent is pretty fast then.
You can fairly easily fix tree_lookup() to be thread-safe by forcing it to make a thread-local copy of the tree before modifying it.
Thanks, Matt
On Oct 19, 2015 7:51 AM, "Patrick Uiterwijk" notifications@github.com wrote:
Well, the old system only edits the master process: the only place where the SIGHUP handler gets attached to sighup_handler is in the main function, all other threads get SIG_IGN for SIGHUP. This means that only the master process, which is listening for new connections, gets the updated pickle, and all children will not. This is where the copy-on-write for the fork() system call comes into play: the master process gets an updated set of variables for the in-memory pickle structures, but to prevent that from syncing out to all children, every child now has to clone the entire old in-memory pickle structure. Especially after we increased the number of children to 80, this was causing a massive processing power and memory burden at the moment of pickle updates. With threading, all children just use the master process' variables, and when those change they follow along.
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149204417 .
Thank you very much for this piece of information! This last patch should handle tree_lookup by always making a local copy.
Looking into the py-radix library, they have added a new function called search_covered. If we would update the packages and the requirement to the latest version (0.9.3), we could use that functions to do the tree_lookup for us instead of having to do a copy or in-memory edit, and just grab the first maxResults entry from its result.
A patch to make it use the search_covered: I am currently using arbitrary max bitlengths to look up.
diff --git a/mirrorlist/mirrorlist_server.py b/mirrorlist/mirrorlist_server.py
index bf463c9..9bc58da 100755
--- a/mirrorlist/mirrorlist_server.py
+++ b/mirrorlist/mirrorlist_server.py
@@ -206,8 +206,16 @@ def tree_lookup(tree, ip, field, maxResults=None):
len_data = 0
if ip is None:
return result
- node = ltree.search_best(ip.strNormal())
- while node is not None:
+ # Determine a max bitlength we want to match against
+ # The values have been chosen arbitrarily
+ max_bitlen = 8
+ if ':' in ip:
+ # IPv6, let's limit further
+ max_bitlen = 32
+
+ ip = '%s/%i' % (ip, max_bitlen)
+ lresults = ltree.serch_covered(ip)[:maxResults]
+ for result in lresults:
prefix = node.prefix
if type(node.data[field]) == list:
len_data += len(node.data[field])
@@ -215,11 +223,6 @@ def tree_lookup(tree, ip, field, maxResults=None):
len_data += 1
t = (prefix, node.data[field],)
result.append(t)
- if maxResults is None or len_data < maxResults:
- ltree.delete(prefix)
- node = ltree.search_best(ip.strNormal())
- else:
- break
return result
Typo ltree.search_covered() On Oct 19, 2015 6:29 PM, "Patrick Uiterwijk" notifications@github.com wrote:
A patch to make it use the search_covered: I am currently using arbitrary max bitlengths to look up.
diff --git a/mirrorlist/mirrorlist_server.py b/mirrorlist/mirrorlist_server.py index bf463c9..9bc58da 100755 --- a/mirrorlist/mirrorlist_server.py +++ b/mirrorlist/mirrorlist_server.py @@ -206,8 +206,16 @@ def tree_lookup(tree, ip, field, maxResults=None): len_data = 0 if ip is None: return result
- node = ltree.search_best(ip.strNormal())
- while node is not None:
Determine a max bitlength we want to match against
The values have been chosen arbitrarily
- max_bitlen = 8
- if ':' in ip:
IPv6, let's limit further
- max_bitlen = 32 +
- ip = '%s/%i' % (ip, max_bitlen)
- lresults = ltree.serch_covered(ip)[:maxResults]
- for result in lresults: prefix = node.prefix if type(node.data[field]) == list: len_data += len(node.data[field]) @@ -215,11 +223,6 @@ def tree_lookup(tree, ip, field, maxResults=None): len_data += 1 t = (prefix, node.data[field],) result.append(t)
- if maxResults is None or len_data < maxResults:
- ltree.delete(prefix)
- node = ltree.search_best(ip.strNormal())
- else:
- break return result
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-149375764 .
Thanks.
What way of fixing would you prefer?
The deep copy method is not going to work in practice, given the size of the trees and objects it has to copy. We should either use search_covered, or implement our own version of it. I will push my fixed commit with search_covered.
Never mind, this does not do what I wanted it to do. I need to think this over some more.
So my logic was correct, but I was using the wrong one. Using search_covering does the trick.
@puiterwijk, so this code looks reasonable to me, but I don't have a basis to independently decide if it is actually better or worse with the change. Can we try this in staging or a cloud node, hit it with a farm of 'ab' runs, and generate some numbers about how it performs with and without the change?
So what's the current status of this PR? @puiterwijk is this code the one we have running now in prod (hotfix during the freeze?)?
This is indeed the code we are now running in production. I will do some performance tests soon.
I have run ab against both versions (threaded and forking) of the mirrorlist servers:
for i in 10 100 200 400; do ab -n 5000 -c $i 'https://mirrors.stg.fedoraproject.org/metalink?repo=epel-7&arch=x86_64'; done
http://lisas.de/~adrian/ab.thread.log http://lisas.de/~adrian/ab.fork.log
The following gives an overview of the longest requests for each concurrency level:
Concurrency | 10 | 100 | 200 | 400 |
---|---|---|---|---|
Forking (worst) | 1861 | 2347 | 7163 | 76404 |
Threaded (worst) | 1794 | 4028 | 8785 | 38452 |
Except for the case with the highest concurrency (400) the results are not very conclusive. As we are aiming for lower concurrency values in the production setup it makes the results even harder to interpret.
My experience with the new threaded mirrorlist server implementation, however, is that is has handled the release pretty good.
Looking at the Time per request
, the difference is quite a bit bigger ^_^ (at least for the two middle ones)
It's about 23ms longer using forking rather than threading. But that's only about 3% faster. Either model works fine for our limited usage.
On Fri, Nov 27, 2015 at 10:39 AM, Pierre-Yves Chibon < notifications@github.com> wrote:
Looking at the Time per request, the difference is quite a bit bigger ^_^
— Reply to this email directly or view it on GitHub https://github.com/fedora-infra/mirrormanager2/pull/140#issuecomment-160172191 .
Problems came up when the number of requests the proxies would pass through becomes significantly bigger than the number of threads used. This will not happen with a single reverse proxy as it just has a limited set of backend connections, but did happen with 12 reverse proxies in combination with a few thousand requests pretty much at the same time. (as we have had when one network send about 50.000 thousand requests within a single minute, spread over the proxies).
Everyone seems to agree with this change (or don't mind) and it has been running in prod for a while making it at least as good as the previous implementation.
Since I'm about to cut a new release, I'm merging this and it will be part of the release.
This improves the performance significantly for several reasons:
One possible reason I could think threads have been used rather than forks in the past is that with forking the childs do not get impacted at the moment of metalink update. However, since the Operating System has cross-thread locks, this is of no issue with regards to locking (the OS will take care of it). The duration of these locks isn't a problem either since the only change that happens under the lock is the swapping of one pointer (the old value) to a new value (the just parsed value).
Signed-off-by: Patrick Uiterwijk puiterwijk@redhat.com