BigResultSet - Githubissues

marcharper commented 8 years ago

For large tournaments, loading all the interactions into memory isn't feasible. @drvinceknight had the idea on #520 to have a special class for this case.

drvinceknight commented 8 years ago

👍 and perhaps with BigResultSet there would be a way to speed things up by parallelising them? (No idea: just a thought.)

drvinceknight commented 8 years ago

Hey @marcharper have you played around with HDF5 http://docs.h5py.org/en/latest/index.html

Looks like this would involve storing the file in a different format (perhaps we could do that with a big switch or something...) and might have other implications.

Just a thought.

marcharper commented 8 years ago

My next idea was to load the data using generators, computing wins, score differences, and such to reduce the memory footprint, possibly going over the file more than once.

I'm open to other options of course. I previously tried a somewhat similar approach with shelve (trying to use the standard library if possible). It's really just about the database build time, if an hdf5 file can be constructed relatively quickly then it makes sense, otherwise simply making multiple passes over a text file is likely faster.

drvinceknight commented 8 years ago

I think the idea of generators sounds better to me. Saw this on a thread on reddit so thought it was worth mentioning :) Perhaps something to fall back on if generators doesn't do the trick :)

marcharper commented 8 years ago

Definitely something to consider when we build out an API that saves large numbers of results over time.

drvinceknight commented 8 years ago

FYI, I've had a little go at this, still very much a work in progress and mainly just playing around.

The main input to the ResultsSet is a dictionary mapping tuples of player index pairs to lists of interactions. So I've got below a class that simply mimics that dictionary using generators and never really (I think) having anything in memory.

Here it is:

import csv

class InteractionsFromFile:
    """A class that behaves like a dictionary but reads from file"""

    def __init__(self, filename):
        self.file = filename

    def __getitem__(self, index_pair):
        """Gives the dictionary like behaviour"""
        interactions = []
        with open(self.file, 'r') as file:
            reader = csv.reader(file)
            for row in reader:
                if index_pair == (int(row[0]), int(row[1])):
                    interactions.append(list(zip(row[4], row[5])))
        return interactions

    def __iter__(self):
        return self.keys()

    def keys(self):
        """Mimics the keys.values()"""
        generator = self.key_generator()
        return generator   

    def key_generator(self):
        """A generator for the keys.
        I use a cache to keep track as you go over the file if it's a new key or not."""
        self.keys_cache = set()
        with open(self.file, 'r') as file:
            reader = csv.reader(file)
            for row in reader:
                key = (int(row[0]), int(row[1]))
                if key not in self.keys_cache:
                    self.keys_cache.add(key)
                    yield key       

    def values(self):
        """Mimics the dict.values()"""
        generator = self.value_generator()
        return generator

    def value_generator(self):
        for key in self.keys():
            yield self[key]

    def items(self):
        """Mimics the dict.items()"""
        generator = self.item_generator()
        return generator

    def item_generator(self):
        for key in self.keys():
            yield (key, self[key])

This means you can simply plug that in to the ResultSet without changing any code there:

>>> rs = axl.ResultSetFromFile("basic.csv")  # The usual results set
>>> irs= InteractionsFromFile("basic.csv")
>>> sorted(rs.interactions.keys()) == sorted(list(irs.keys()))  # Checking that `.keys()` gives the same
>>> sorted(rs.interactions.values()) == sorted(list(irs.values()))  # Checking `.values()`
>>> sorted(rs.interactions.items()) == sorted(list(irs.items()))   # Checking `.items()`

Those last three checks ensure we have the behaviour that the ResultsSet expects.

You can then create a ResultsSet like this:

>>> brs = axl.ResultSet(rs.players, irs)

(There I'm 'borrowing' the players list from rs.players but easy enough to get that somewhere else/calculate it again.)

This is SUPER SLOW at the moment and I haven't checked that memory is definitely not being used. If the heat of my laptop on my lap, writing to and from disk is anything to go by I think it's not...?

Could be that we can add some optimisations in the ResultsSet and/or change how the ResultSet does stuff. I also think that if we can assume the data file is sorted or at least that interactions between players aren't spread around (which I'm currently not) then it could be sped up (every get_item reads every line of the data file).

Anyway, I'm going to keep playing around with this but just thought I'd let you know that I was giving it a try. :) 👍 (This could be the completely wrong way around this.)

marcharper commented 8 years ago

Could be that we can add some optimisations in the ResultsSet and/or change how the ResultSet does stuff. I also think that if we can assume the data file is sorted or at least that interactions between players aren't spread around (which I'm currently not) then it could be sped up (every get_item reads every line of the data file).

That's probably a good idea.

Some thoughts:

It's going to be slower than memory of course. The only solution is to throw RAM at it, and for our purposes that's not a reasonable assumption.
This implementation of __getitem__ is going to be really slow -- it iterates over the entire data file each time
Using generators typically keeps memory usage low, that's definitely the right approach
Try buffered reads -- Python should do this for you somewhat but I've had success in the past manually reading in a set chunk of data at a time.
In this case we need to check how many turns are in each match and go for something specific, like 1m actions (so 1e6 / (2 * turns) rows of interactions)
If you integrate more deeply into ResultsSet, you could precompute a lot of things on a single pass, and only hit disk a second time if you actually need the full set of interactions again. In other words, use one function to compute wins, scores, etc. on a single pass through interactions (compute the quantities on batches of interactions)
Parallelize? Disk read is probably the bottleneck though. Solid state drives are much faster (if you don't have one already...)

One more thought -- we're computing matches in batches now, we could compute (all?) the desired quantities in memory before writing to disk. Maybe write all this to a second data file, keeping interactions in their current file. Shelves might be a good solution here (CSV is probably also fine, or a pandas dataframe if you can tolerate the dependency). Then we'd enhance BigResultsSet to return the precomputed wins and such (stored in memory perhaps) and only hit the disk if the user asks for all the interactions.

You can probably see why I haven't written this yet. To do it right will probably take a significant amount of refactoring, and in any case, several hours of uninterrupted time to figure it all out.

drvinceknight commented 8 years ago

Just heading off for the night so I'll read through your comment in detail tomorrow (thank you for it, I am not at all dismissing it: in fact just the opposite!).

FWIW it doesn't look like it's as slow as I thought, I had in fact just stalled because I hadn't implemented a __iter__ method. (Have edited the class above).

I'll do some analysis of the memory and time consumption of what I've got. Perhaps if the memory consumption is indeed low and the time isn't ridiculously slower this could be worth optimising... Anyway, will read through your comment tomorrow and get back to you.

👍

marcharper commented 8 years ago

I'm guessing you could cut the run time of this implemenation by a third or so by combing the functions in ResultSet to only make one pass over the interactions file. Disk read is the bottleneck almost surely.

drvinceknight commented 8 years ago

Some thoughts:

It's going to be slower than memory of course. The only solution is to throw RAM at it, and for our purposes that's not a reasonable assumption.

Yup :)

This implementation of getitem is going to be really slow -- it iterates over the entire data file each time

Yeah, this is brutally terrible. Have thought more about it over night and this is just stupid.

Using generators typically keeps memory usage low, that's definitely the right approach

Is there a need to wrap the csv.reader in a generator? I saw that in a video somewhere but it doesn't make sense to me... I'll play around some more...

Try buffered reads -- Python should do this for you somewhat but I've had success in the past manually reading in a set chunk of data at a time. In this case we need to check how many turns are in each match and go for something specific, like 1m actions (so 1e6 / (2 * turns) rows of interactions)

Cool, batch size could be a tuneable parameter also perhaps?

If you integrate more deeply into ResultsSet, you could precompute a lot of things on a single pass, and only hit disk a second time if you actually need the full set of interactions again. In other words, use one function to compute wins, scores, etc. on a single pass through interactions (compute the quantities on batches of interactions)

I think this is ultimately what needs to happen. The fact that the current results set goes through interactions multiple times made sense from a readability point of view but also because they're dictionaries. Certainly doesn't make sense the way I've implemented it here (loop through for every single tuple lookup... lol this really is terrible).

I think it shouldn't be too hard (famous last words) to get it to do this over two or three passes (possibly even just the one...). I'll have a go at this :)

Parallelize? Disk read is probably the bottleneck though. Solid state drives are much faster (if you don't have one already...)

Yeah this is to/from ssd. I think there are possibly some things that can be done in parallel. Could be helpful if I drew a map of all the results perhaps (http://axelrod.readthedocs.io/en/latest/tutorials/getting_started/visualising_results.html) to see what needs what... 💭

One more thought -- we're computing matches in batches now, we could compute (all?) the desired quantities in memory before writing to disk. Maybe write all this to a second data file, keeping interactions in their current file. Shelves might be a good solution here (CSV is probably also fine, or a pandas dataframe if you can tolerate the dependency). Then we'd enhance BigResultsSet to return the precomputed wins and such (stored in memory perhaps) and only hit the disk if the user asks for all the interactions.

I'm hesitant about computing the results during the tournament run:

This is (kind of) what we used to do but it tied up various assumptions about tournament structure and I felt hurt readability (I found untangling things when we implemented probabilistic ending tournaments tricky, mainly because of this).
I like that the library can analyse any data set of tournament interactions (for example I could even use it to analyse some of the games I play with students... I hadn't thought about that before... I should do that just for fun...) <- we could of course just keep this ability in.

I think I'm just saying that I'd like to avoid this if possible but that's my initial 'slept on it' thinking... There might be a way to do it that doesn't mess with my worries about 1. and 2.

You can probably see why I haven't written this yet. To do it right will probably take a significant amount of refactoring, and in any case, several hours of uninterrupted time to figure it all out.

Yeah, my having a try isn't at all meant as a prod in your direction. I know you're busy :) I'm going to give this a go, just modifying the ResultSetFromFile and working through the attributes attempting to compute them during a pass (or 2) of the data. I'll let you know how I get on :) 👍

drvinceknight commented 8 years ago

Is there a need to wrap the csv.reader in a generator? I saw that in a video somewhere but it doesn't make sense to me... I'll play around some more...

Don't answer that :) Figured it out, playing around with reading in in batches :)

drvinceknight commented 8 years ago

I think it's doable in two passes :)

Should have something 'showable' (I'm basically just 'doodling' in a notebook right now) by the end of today :)

marcharper commented 8 years ago

Great, yeah I think working through the various ResultsSet functions to reduce the number of passes on the data file will give the biggest performance boost. It's not as readable / as nicely separated but much more efficient in this case.

drvinceknight commented 8 years ago

So my attempt at this is here: branch 573: it's very raw so not well documented, (not at all) tested etc (sorry!)...

There are currently two commits (and this PR: https://github.com/Axelrod-Python/Axelrod/pull/671) on that branch ahead of master:

3928e5a9ef155582a30f8142805385f3481b9ffa (implements equality between result sets, was using this for my casual tests - does not always work because of floats).
7500fed019011641a718eab9e082a0252d7ee85b adds in BigResultSet (VERY RAW).
1. Needs to be profiled (is this any good?) - assuming this then:
2. Needs to be tested (have done some casual testsing: all results seem correct)
3. Needs to be refactored. The terribly long method that does everything can be tidied substantially/broken out in to smaller methods etc...
4. Progress bar and other niceties need to be added.

Assuming this is any good (I didn't figure out how to profile memory properly, but it actually looks like it might be faster???) I reckon the readability could be made not too bad with some nice refactoring. It could possibly be worth replacing ResultSetFromFile with this (or hopefully a streamlined and even faster version of this), especially if I'm right and it's faster.

I'm signing off for the weekend (or at least Saturday) so if anyone wanted to play with it please do :)

drvinceknight commented 8 years ago

If I'm reading below correctly this is quite good:

    17  69.81641 MiB   0.00000 MiB       filename = "demo.csv"
    18  69.82031 MiB   0.00391 MiB       print(filename)
    19                                 # Demo strategies
    20                                 # Turns: 200
    21                                 # Repetitions: 100
    22  79.21875 MiB   9.39844 MiB       read(filename)
    23  79.02734 MiB  -0.19141 MiB       big_read(filename)
    24
    25  79.02734 MiB   0.00000 MiB       filename = "basic.csv"
    26  79.02734 MiB   0.00000 MiB       print(filename)
    27                                 # Basic strategies
    28                                 # Turns: 200
    29                                 # Repetitions: 100
    30  88.39844 MiB   9.37109 MiB       read(filename)
    31  88.49609 MiB   0.09766 MiB       big_read(filename)

For example using the basic strategies, the normal read take 9.47 mb, the 'big_read' (using the BigResultSet uses less than 0.1 mb. Just running some larger test cases and will refactor/test/improve the class tomorrow (I already have a couple of improvements lined up in my mind).

EDIT: FYI https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling

drvinceknight commented 8 years ago

So I've run the profiler on the following:

Demo strategies (200 turns, 100 repetitions)
Basic strategies (200 turns, 100 repetitions)
Full 'ordinary' strategies (50 turns, 5 repetitions)
Full 'ordinary' strategies (50 turns, 10 repetitions)
Full 'ordinary' strategies (50 turns, 20 repetitions)

The output is below:

Line #    Mem usage    Increment   Line Contents
================================================
    19                                 # Demo strategies
    20                                 # Turns: 200
    21                                 # Repetitions: 100
    22  27.80469 MiB   2.73047 MiB       read(filename)
    23  27.80469 MiB   0.00000 MiB       big_read(filename)

    27                                 # Basic strategies
    28                                 # Turns: 200
    29                                 # Repetitions: 100
    30  31.79297 MiB   3.98828 MiB       read(filename)
    31  31.60938 MiB  -0.18359 MiB       big_read(filename)

    35                                 # Ordinary strategies
    36                                 # Turns: 50
    37                                 # Repetitions: 5
    38  42.25391 MiB  10.64453 MiB       read(filename)
    39  35.78516 MiB  -6.46875 MiB       big_read(filename)

    43                                 # Ordinary strategies
    44                                 # Turns: 50
    45                                 # Repetitions: 10
    46  52.45312 MiB  16.66797 MiB       read(filename)
    47  51.07422 MiB  -1.37891 MiB       big_read(filename)

    51                                 # Ordinary strategies
    52                                 # Turns: 50
    53                                 # Repetitions: 20
    54  72.91016 MiB  21.83594 MiB       read(filename)
    55  72.78516 MiB  -0.12500 MiB       big_read(filename)

Assuming I am reading that correctly it looks like the memory consumption of the BigResultSet is pretty insignificant... You can find all that here: https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling/blob/master/memory_profiler.log

I'm going to rerun it but inverting the order in which each is done (so run the BigResultSet first). Just to make sure there's nothing weird happening.

I've started refactoring and adding tests. I'll open a PR once that's done, I suggest that after that we could look at some timing experiments and potentially (I'm not convinced the BigResultSet is significantly/noticeably slower because it's just got the one loop really) replace the ResultSetFromFile with it (but that can be another discussion...)

marcharper commented 8 years ago

Great! This is actually what I expect from using generators. Any sense about the speed?

drvinceknight commented 8 years ago

Yeah you did call this! :)

I actually think it might be faster (basically just the one loop as opposed to 15 odd...) :) Going to refactor (will finish tomorrow I reckon) then I'll do some time experiments too :)

drvinceknight commented 8 years ago

Have re run in opposite order (just to be sure). Results are the same (can be seen here: https://github.com/drvinceknight/Axelrod-Python-Memory-Profiling/blob/master/memory_profiler.log).

drvinceknight commented 8 years ago

I'll do some time experiments too

So this seems to be much faster:

Demo strategies (20 turns, 5 repetitions):
- BigResultSet: 10 loops, best of 3: 30.2 ms per loop
- ResultSetFromFile: 100 loops, best of 3: 6.73 ms per loop
All strategies (20 turns, 5 repetitions):
- BigResultSet: 1 loop, best of 3: 4.74 s per loop
- ResultSetFromFile: 1 loop, best of 3: 2min 57s per loop

There are some tweaks I still want to do and no doubt improvements that will come from the PR review.

drvinceknight commented 8 years ago

Just FYI. Have just pushed some more commits to the 573 branch. Amongst some refactoring and tests this also includes a much more precise progress bar but also a suggestion for how to 'plug' in the BigResultSet in to the tournament (we could just replace but this loses the ability to actually manipulate the interactions if we wanted to ◀️ there's probably a few ways of tweaking this so might be easier to leave discussion to the PR).

My plan is:

Re run the memory profiler just to make sure I haven't broken anything with regards to the memory (because I'm sometimes an idiot like that)
Add more tests (I have basically an end to end test but wouldn't mind some more because tests are awesome)
Refactor some more (much pretty, many better)
Open the PR

Should be able to do that by end of tomorrow (famous last words).

Axelrod-Python / Axelrod

BigResultSet #573