Closed marcharper closed 8 years ago
👍 and perhaps with BigResultSet
there would be a way to speed things up by parallelising them? (No idea: just a thought.)
Hey @marcharper have you played around with HDF5 http://docs.h5py.org/en/latest/index.html
Looks like this would involve storing the file in a different format (perhaps we could do that with a big
switch or something...) and might have other implications.
Just a thought.
My next idea was to load the data using generators, computing wins, score differences, and such to reduce the memory footprint, possibly going over the file more than once.
I'm open to other options of course. I previously tried a somewhat similar approach with shelve
(trying to use the standard library if possible). It's really just about the database build time, if an hdf5 file can be constructed relatively quickly then it makes sense, otherwise simply making multiple passes over a text file is likely faster.
I think the idea of generators sounds better to me. Saw this on a thread on reddit so thought it was worth mentioning :) Perhaps something to fall back on if generators doesn't do the trick :)
Definitely something to consider when we build out an API that saves large numbers of results over time.
FYI, I've had a little go at this, still very much a work in progress and mainly just playing around.
The main input to the ResultsSet
is a dictionary mapping tuples of player index pairs to lists of interactions. So I've got below a class that simply mimics that dictionary using generators and never really (I think) having anything in memory.
Here it is:
import csv
class InteractionsFromFile:
"""A class that behaves like a dictionary but reads from file"""
def __init__(self, filename):
self.file = filename
def __getitem__(self, index_pair):
"""Gives the dictionary like behaviour"""
interactions = []
with open(self.file, 'r') as file:
reader = csv.reader(file)
for row in reader:
if index_pair == (int(row[0]), int(row[1])):
interactions.append(list(zip(row[4], row[5])))
return interactions
def __iter__(self):
return self.keys()
def keys(self):
"""Mimics the keys.values()"""
generator = self.key_generator()
return generator
def key_generator(self):
"""A generator for the keys.
I use a cache to keep track as you go over the file if it's a new key or not."""
self.keys_cache = set()
with open(self.file, 'r') as file:
reader = csv.reader(file)
for row in reader:
key = (int(row[0]), int(row[1]))
if key not in self.keys_cache:
self.keys_cache.add(key)
yield key
def values(self):
"""Mimics the dict.values()"""
generator = self.value_generator()
return generator
def value_generator(self):
for key in self.keys():
yield self[key]
def items(self):
"""Mimics the dict.items()"""
generator = self.item_generator()
return generator
def item_generator(self):
for key in self.keys():
yield (key, self[key])
This means you can simply plug that in to the ResultSet
without changing any code there:
>>> rs = axl.ResultSetFromFile("basic.csv") # The usual results set
>>> irs= InteractionsFromFile("basic.csv")
>>> sorted(rs.interactions.keys()) == sorted(list(irs.keys())) # Checking that `.keys()` gives the same
>>> sorted(rs.interactions.values()) == sorted(list(irs.values())) # Checking `.values()`
>>> sorted(rs.interactions.items()) == sorted(list(irs.items())) # Checking `.items()`
Those last three checks ensure we have the behaviour that the ResultsSet
expects.
You can then create a ResultsSet
like this:
>>> brs = axl.ResultSet(rs.players, irs)
(There I'm 'borrowing' the players list from rs.players
but easy enough to get that somewhere else/calculate it again.)
This is SUPER SLOW at the moment and I haven't checked that memory is definitely not being used. If the heat of my laptop on my lap, writing to and from disk is anything to go by I think it's not...?
Could be that we can add some optimisations in the ResultsSet
and/or change how the ResultSet
does stuff. I also think that if we can assume the data file is sorted or at least that interactions between players aren't spread around (which I'm currently not) then it could be sped up (every get_item
reads every line of the data file).
Anyway, I'm going to keep playing around with this but just thought I'd let you know that I was giving it a try. :) 👍 (This could be the completely wrong way around this.)
Could be that we can add some optimisations in the ResultsSet and/or change how the ResultSet does stuff. I also think that if we can assume the data file is sorted or at least that interactions between players aren't spread around (which I'm currently not) then it could be sped up (every get_item reads every line of the data file).
That's probably a good idea.
Some thoughts:
__getitem__
is going to be really slow -- it iterates over the entire data file each time1e6 / (2 * turns)
rows of interactions)One more thought -- we're computing matches in batches now, we could compute (all?) the desired quantities in memory before writing to disk. Maybe write all this to a second data file, keeping interactions in their current file. Shelves might be a good solution here (CSV is probably also fine, or a pandas dataframe if you can tolerate the dependency). Then we'd enhance BigResultsSet
to return the precomputed wins and such (stored in memory perhaps) and only hit the disk if the user asks for all the interactions.
You can probably see why I haven't written this yet. To do it right will probably take a significant amount of refactoring, and in any case, several hours of uninterrupted time to figure it all out.
Just heading off for the night so I'll read through your comment in detail tomorrow (thank you for it, I am not at all dismissing it: in fact just the opposite!).
FWIW it doesn't look like it's as slow as I thought, I had in fact just stalled because I hadn't implemented a __iter__
method. (Have edited the class above).
I'll do some analysis of the memory and time consumption of what I've got. Perhaps if the memory consumption is indeed low and the time isn't ridiculously slower this could be worth optimising... Anyway, will read through your comment tomorrow and get back to you.
👍
I'm guessing you could cut the run time of this implemenation by a third or so by combing the functions in ResultSet
to only make one pass over the interactions file. Disk read is the bottleneck almost surely.
Some thoughts:
It's going to be slower than memory of course. The only solution is to throw RAM at it, and for our purposes that's not a reasonable assumption.
Yup :)
This implementation of getitem is going to be really slow -- it iterates over the entire data file each time
Yeah, this is brutally terrible. Have thought more about it over night and this is just stupid.
Using generators typically keeps memory usage low, that's definitely the right approach
Is there a need to wrap the csv.reader
in a generator? I saw that in a video somewhere but it doesn't make sense to me... I'll play around some more...
Try buffered reads -- Python should do this for you somewhat but I've had success in the past manually reading in a set chunk of data at a time. In this case we need to check how many turns are in each match and go for something specific, like 1m actions (so 1e6 / (2 * turns) rows of interactions)
Cool, batch size could be a tuneable parameter also perhaps?
If you integrate more deeply into ResultsSet, you could precompute a lot of things on a single pass, and only hit disk a second time if you actually need the full set of interactions again. In other words, use one function to compute wins, scores, etc. on a single pass through interactions (compute the quantities on batches of interactions)
I think this is ultimately what needs to happen. The fact that the current results set goes through interactions multiple times made sense from a readability point of view but also because they're dictionaries. Certainly doesn't make sense the way I've implemented it here (loop through for every single tuple lookup... lol this really is terrible).
I think it shouldn't be too hard (famous last words) to get it to do this over two or three passes (possibly even just the one...). I'll have a go at this :)
Parallelize? Disk read is probably the bottleneck though. Solid state drives are much faster (if you don't have one already...)
Yeah this is to/from ssd. I think there are possibly some things that can be done in parallel. Could be helpful if I drew a map of all the results perhaps (http://axelrod.readthedocs.io/en/latest/tutorials/getting_started/visualising_results.html) to see what needs what... 💭
One more thought -- we're computing matches in batches now, we could compute (all?) the desired quantities in memory before writing to disk. Maybe write all this to a second data file, keeping interactions in their current file. Shelves might be a good solution here (CSV is probably also fine, or a pandas dataframe if you can tolerate the dependency). Then we'd enhance BigResultsSet to return the precomputed wins and such (stored in memory perhaps) and only hit the disk if the user asks for all the interactions.
I'm hesitant about computing the results during the tournament run:
I think I'm just saying that I'd like to avoid this if possible but that's my initial 'slept on it' thinking... There might be a way to do it that doesn't mess with my worries about 1. and 2.
You can probably see why I haven't written this yet. To do it right will probably take a significant amount of refactoring, and in any case, several hours of uninterrupted time to figure it all out.
Yeah, my having a try isn't at all meant as a prod in your direction. I know you're busy :) I'm going to give this a go, just modifying the ResultSetFromFile
and working through the attributes attempting to compute them during a pass (or 2) of the data. I'll let you know how I get on :) 👍
Is there a need to wrap the csv.reader in a generator? I saw that in a video somewhere but it doesn't make sense to me... I'll play around some more...
Don't answer that :) Figured it out, playing around with reading in in batches :)
I think it's doable in two passes :)
Should have something 'showable' (I'm basically just 'doodling' in a notebook right now) by the end of today :)
Great, yeah I think working through the various ResultsSet
functions to reduce the number of passes on the data file will give the biggest performance boost. It's not as readable / as nicely separated but much more efficient in this case.
So my attempt at this is here: branch 573: it's very raw so not well documented, (not at all) tested etc (sorry!)...
There are currently two commits (and this PR: https://github.com/Axelrod-Python/Axelrod/pull/671) on that branch ahead of master:
BigResultSet
(VERY RAW).
Assuming this is any good (I didn't figure out how to profile memory properly, but it actually looks like it might be faster???) I reckon the readability could be made not too bad with some nice refactoring. It could possibly be worth replacing ResultSetFromFile
with this (or hopefully a streamlined and even faster version of this), especially if I'm right and it's faster.
I'm signing off for the weekend (or at least Saturday) so if anyone wanted to play with it please do :)
If I'm reading below correctly this is quite good:
17 69.81641 MiB 0.00000 MiB filename = "demo.csv"
18 69.82031 MiB 0.00391 MiB print(filename)
19 # Demo strategies
20 # Turns: 200
21 # Repetitions: 100
22 79.21875 MiB 9.39844 MiB read(filename)
23 79.02734 MiB -0.19141 MiB big_read(filename)
24
25 79.02734 MiB 0.00000 MiB filename = "basic.csv"
26 79.02734 MiB 0.00000 MiB print(filename)
27 # Basic strategies
28 # Turns: 200
29 # Repetitions: 100
30 88.39844 MiB 9.37109 MiB read(filename)
31 88.49609 MiB 0.09766 MiB big_read(filename)
For example using the basic strategies, the normal read take 9.47 mb, the 'big_read' (using the BigResultSet
uses less than 0.1 mb. Just running some larger test cases and will refactor/test/improve the class tomorrow (I already have a couple of improvements lined up in my mind).
EDIT: FYI https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling
So I've run the profiler on the following:
The output is below:
Line # Mem usage Increment Line Contents
================================================
19 # Demo strategies
20 # Turns: 200
21 # Repetitions: 100
22 27.80469 MiB 2.73047 MiB read(filename)
23 27.80469 MiB 0.00000 MiB big_read(filename)
27 # Basic strategies
28 # Turns: 200
29 # Repetitions: 100
30 31.79297 MiB 3.98828 MiB read(filename)
31 31.60938 MiB -0.18359 MiB big_read(filename)
35 # Ordinary strategies
36 # Turns: 50
37 # Repetitions: 5
38 42.25391 MiB 10.64453 MiB read(filename)
39 35.78516 MiB -6.46875 MiB big_read(filename)
43 # Ordinary strategies
44 # Turns: 50
45 # Repetitions: 10
46 52.45312 MiB 16.66797 MiB read(filename)
47 51.07422 MiB -1.37891 MiB big_read(filename)
51 # Ordinary strategies
52 # Turns: 50
53 # Repetitions: 20
54 72.91016 MiB 21.83594 MiB read(filename)
55 72.78516 MiB -0.12500 MiB big_read(filename)
Assuming I am reading that correctly it looks like the memory consumption of the BigResultSet
is pretty insignificant... You can find all that here: https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling/blob/master/memory_profiler.log
I'm going to rerun it but inverting the order in which each is done (so run the BigResultSet
first). Just to make sure there's nothing weird happening.
I've started refactoring and adding tests. I'll open a PR once that's done, I suggest that after that we could look at some timing experiments and potentially (I'm not convinced the BigResultSet
is significantly/noticeably slower because it's just got the one loop really) replace the ResultSetFromFile
with it (but that can be another discussion...)
Great! This is actually what I expect from using generators. Any sense about the speed?
Yeah you did call this! :)
I actually think it might be faster (basically just the one loop as opposed to 15 odd...) :) Going to refactor (will finish tomorrow I reckon) then I'll do some time experiments too :)
Have re run in opposite order (just to be sure). Results are the same (can be seen here: https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling/blob/master/memory_profiler.log).
I'll do some time experiments too
So this seems to be much faster:
10 loops, best of 3: 30.2 ms per loop
100 loops, best of 3: 6.73 ms per loop
1 loop, best of 3: 4.74 s per loop
1 loop, best of 3: 2min 57s per loop
There are some tweaks I still want to do and no doubt improvements that will come from the PR review.
Just FYI. Have just pushed some more commits to the 573 branch. Amongst some refactoring and tests this also includes a much more precise progress bar but also a suggestion for how to 'plug' in the BigResultSet
in to the tournament (we could just replace but this loses the ability to actually manipulate the interactions if we wanted to ◀️ there's probably a few ways of tweaking this so might be easier to leave discussion to the PR).
My plan is:
Should be able to do that by end of tomorrow (famous last words).
For large tournaments, loading all the interactions into memory isn't feasible. @drvinceknight had the idea on #520 to have a special class for this case.