brython-dev / brython

Brython (Browser Python) is an implementation of Python 3 running in the browser
BSD 3-Clause "New" or "Revised" License
6.38k stars 511 forks source link

The dictionary implementation is painfully slow #19

Closed Pfiver closed 9 years ago

Pfiver commented 10 years ago

The performance is really suboptimal. Looking at the code, it quickly becomes clear why: To find an object, the whole dictionary is searched in a loop (i.e. https://github.com/brython-dev/brython/blob/e1dc7b4575280dd7a00e3012890c7c2533d48e32/src/py_dict.js#L68). I expected to find some sort of hash map. Don't python objects have hash functions or the like? I don't know, but I guess it would be possible to find something. :-) Anyhow - I converted the hash maps into arrays to solve the performance issues and am very happy with Brython, it is a great project. To who it may concern: Thank you for all the hard work and making it available open source!

ghost commented 10 years ago

yikes, considering javascript has objects which can be used like associative arrays I'm wondering why it would be like this?

Pfiver commented 10 years ago

This is documented in the source code. Javascript object property keys are limited to strings while Python dictionary keys can be arbitrary objects. But I feel the current big array could at least be subdivided into "buckets" and the Python objects' str() method's output or a hash or fixed-length prefix of that be used as keys to address the buckets which would be stored in JS objects. But I'd say let's look at the Python source a bit more and see how dictionaries are actually implemented there. There is no hash_code() method or the like, as e.g. for Java object afaik. Anyway ... it's al documented pretty well (http://en.wikipedia.org/wiki/Hash_table) but would take some work... :-)

....

Looks like the "hash()" builtin is used in CPython. https://hg.python.org/cpython/file/22a46f05ce23/Objects/dictobject.c . And Jason Sundram knows that http://stackoverflow.com/a/2070320 it is implemented differently (and natively, in C) for different object types. Assuming that finding a good hash function is at the heart of this problem, this might be the reason why it is not solved yet.

ghost commented 10 years ago

__hash__ isn't even used in Brython though (at least not for its intended purpose!). And yes, you can use objects as keys in javascript

I went through the code and changed it all references from $keys and $values to $data and appropriately changed the looping code. Initial tests are promising although I have a bit of debugging to do still.

ghost commented 10 years ago

Python's dict source code is very low level and would be slow to mimic with javascript.

I get what you're saying now though, you can use objects and what not but it has the risk of collisions, such as d[4] and d['4'] both pointing to the same thing. I forgot about that aspect of javascript.

The issue with any solution is that you need to be able to extract an object from the key, and any changes to the underlying object need to be reflected in the key when the key is retrieved. In other words, you can't just simply rely on a string representation of an object to use as a key.

ghost commented 10 years ago

After having some time to sleep on it, I think the best implementation is similar to what we have now, except it emulates the auto-growing hash of CPython. You'd use the hash function to figure out which bucket an item falls into, and then probably transform it from a single item to a list of items if multiple items fall into the same bucket. You'd have an array of keys and an array of items just like now. With CPython, the minimum size is 8 and the array grows by a factor of 2 when the size >= 75% of the array. I didn't get a chance to see how it shrinks, but I'd imagine that it would shrink when you go down a factor of two. I.E. if the array size just grew to 64 (with 24 items), then it would grow again at 48, and shrink at 12

Maybe this was the original plan for dicts and it was scrapped for some reason? Are there any objects that hash doesn't work for that it should?

PierreQuentel commented 10 years ago

You can't use Javascript arrays like Python dicts : the Javascript below produces unexpected results for us Pythonistas

a={'x':1} 
b={'y':2}
t = {}
t[a]='a'
t[b]='b'
console.log(t[a])
for(attr in t){console.log(attr)}

The current implementation in Brython is slow, and it's not even compliant with CPython, because this code should raise an exception :

t = {}
a=[1]
t[a] = 7
print(t)

A fast and compliant implementation should use the hash value of the objects that are hashable in CPython (https://docs.python.org/3/glossary.html#term-hashable). Most built-in types already have a hash method, but not instances of built-in classes

If you can work on this it would be very helpful

ghost commented 10 years ago

on it.

ghost commented 10 years ago

So far so good

a_list = []
a_dict = {}
for i in range(1000):
    val = "Hello%i" % i
    a_list.append(val)
    a_dict[val] = 1

import time
start = time.time()
for val in a_list:
    idx = a_list.index(val)
    tmp = a_list[idx]
print("%.02f" % (time.time() - start))

start = time.time()
for val in a_list:
    tmp = a_dict[val]
print("%.02f" % (time.time() - start))

on my new implementation is giving:

2.87
0.03

And yes, you can do this:

a = {}
a['5'] = 10
a[5] = 11
print(a['5'])
print(a[5])
10
11

I implemented traversals as an array scan that basically checks for each value. I believe this is how it works in python as well, as in testing traversal was slower than insertion. I might be able to speed this number up a bit as well.

start = time.time()
for itm in a_list:
    pass
print("%.02f" % (time.time() - start))

start = time.time()
for itm in a_dict.items():
    pass
print("%.02f" % (time.time() - start))
k,v = itm
print("k: %s, v: %s" % (k,v))
0.00
0.19

Still need to get all the built-in methods updated and then update the code that unfortunately relied on the previous implementation of dicts in order to work. Might be a few days but I think having this ready before end of this week is realistic.

ghost commented 10 years ago

To give an update: I unfortunately haven't found time to finish ironing this out, but it sounds like it's best that I hold off on this anyways until we see the rewrite which changes how namespacing works.

Here's the current state of my changes if anyone's interested:

https://github.com/JamesHutchison/brython/blob/dict_refactor/src/py_dict.js

ghost commented 10 years ago

Pull request:

https://github.com/brython-dev/brython/pull/41

joaoventura commented 9 years ago

Hi, just saw this Pull Request now. I am working in another solution for his problem which is much much simpler and I would expected to be much faster.

Instead of managing the dictionary ourselves, we should rely absolutely in javascript dicts which are JIT'ed and are capable of performing 300,000,000 Ops/sec on my computer (Firefox, OSX 10.9). We can use hash() or the hash functions to generate a hash to index an object in a JS dict, and to avoid collisions, each entry on the JS dict would be an array (obj or key, value).

So, it is quite similar to this approach by James Hutchison but we don't need to manage the JS dict ourselves. Brython's dict would be just a small wrapper to a JS dict (and just because of hash translations), and that would give us near Javascript performance, at least theoretically.

ghost commented 9 years ago

I considered this approach but for reasons I can't remember went with emulating the CPython implementation.

joaoventura commented 9 years ago

I've tested your approach against a basic version of mine (as in this commit: https://github.com/joaoventura/brython/commit/c0a067b10100cceadb174a52621abecb1c06b1d6) using:

a = {1:1, 2:2, 3:3, 4:4, 5:5, 6:6, 7:7, 8:8, 9:9, 0:0}
for _ in range(100000):
    a[8]

and the results were:

Original py_dict: 4.94s

Py_dict "James": 1.40s

Py_dict "JS DIct": 1.15s

The differences between what I have done and your version are quite negligible, but both are 400%-500% faster than the current approach. I still haven't implemented the iterators, which is what the rest of my code really depends on.

ghost commented 9 years ago

Not really a good test without doing a full implementation that also tests insert, delete, and thrashing performance (a ton of inserts, deletes, and look-ups to a single dict). Could be either implementation's performance difference is a wash when compared against each other using typical data.

ghost commented 9 years ago

Great work guys! What is the status on this? What can I do to help?

ghost commented 9 years ago

Getting it merged into the main branch would be perfect :) Sounds like @joaoventura hasn't gotten around to an alternative implementation so the one I have will have to do for now. Thanks for picking that up

ghost commented 9 years ago

One of the biggest challenges is to merge it into brython-dev/brython. It seems that you forked from PierreQuentel/brython instead which is several hundred commits behind brython-dev/brython. No worries though. I will work with Pierre to see if he can remove (or delete) PierreQuentel/brython. I'll also work on migrating the changes so that all our tests execute successfully.

Billy

ghost commented 9 years ago

Really, the only code that matters is py_dict.js. The rest of the changes to other files were to fix code that were dependent on the previous dict implementation. Might be easier to just redo those from scratch at this point, since a lot of them might be non-mergable or non-applicable. Just gotta grep out references to $keys and $values.

ghost commented 9 years ago

James,

You are mostly correct with your suggestion of finding and modifying the references to $keys and $values. I have most/all of those changed, but to pass all our tests, is a little more complicated. For example, since you use hash to do key lookups (Just to get this out there, I believe your implementation is the correct way of doing this), we have some data types (such as complex) that didn't calculate hashes correctly, etc. Its just about finding these small issues and getting them fixed. I've also made a few adjustments to your py_dict.js file (but not too many, or maybe they were issues I introduced while bugging). Right now, I'm down to 4 test files that do not work. When I started, it was more like 15 or so that had some type of issue.

I have not run any major tests, but so far, I do not see a speed up. This could be because, most dicts I'm using for tests do not contain a large amount of items, so the overhead cost maybe larger than Pierre's original dict implementation. We hope when this is finished, to have a more pluggable system so that people can develop a new dict module and just plug it in and do some performance checks. . Billy

On Tue, Dec 30, 2014 at 12:34 PM, JamesHutchison notifications@github.com wrote:

Really, the only code that matters is py_dict.js. The rest of the changes to other files were to fix code that were dependent on the previous dict implementation. Might be easier to just redo those from scratch at this point, since a lot of them might be non-mergable or non-applicable. Just gotta grep out references to $keys and $values.

— Reply to this email directly or view it on GitHub https://github.com/brython-dev/brython/issues/19#issuecomment-68383006.

ghost commented 9 years ago

Speed-up should be night and day when doing random look-ups over 1k+ items. There might be some minor performance gains by speed-optimizing the iteration process for lesser used functions, but I'd rather tackle that later after the dict implementation is bug free.

ghost commented 9 years ago

I'm going to close this issue since the new implementation has been put into place. Feel free to reopen this issue (or create a new one), if a related issue arises.