Open GoogleCodeExporter opened 9 years ago
tctdbiternext3 does not return a list of columns (the doc is confusing about
that). It returns a TCMAP equivalent to {'': key}.update(tdb[key]).
As for performance, using tctdbiternext3 to iterate over the values is faster
than using something like: (tdb[key] for key in tdb).
Hmmm..., maybe it's time for iterkeys/itervalues/iteritems. These could return
faster iterators (Python2/3 compat would suffer though).
Ideas? Comments?
Original comment by lekma...@gmail.com
on 25 Jun 2010 at 10:11
The iter*s methods would be great but I'm unsure about the 2/3 issue. At some
point we'll have to migrate all the code but when... maintaining two forks is
also a bad idea. I'd stick to 2.6..2.7 for some time (maybe until 3.x makes
into Debian stable?). Or maybe even check the Python version in keys(),
values() and items() and e.g. return either self.iterkeys() or
list(self.iterkeys()). This would be a hack but a tolerable one I'd say...
Original comment by neithere
on 27 Jun 2010 at 1:25
So, I added iterkeys, itervalues, iteritems to TDB (and HDB), and 2 additional
methods to TDB: itervalueskeys (which should be the one you need if I
understand correctly) and itervaluesvals. These return iterators over TDB's
values' keys and values respectively (I'm not sure I'm being clear here).
I'd love if you could test the current trunk. If you have no objections/bugs,
I'll get 0.6.1 out asap.
Original comment by lekma...@gmail.com
on 29 Jun 2010 at 3:28
Oh, and I'm not sure about the names, so if you have a better suggestion...
Original comment by lekma...@gmail.com
on 29 Jun 2010 at 3:33
Sorry for the delay. I think names like "iter_column_names" and
"iter_column_values" (or something in that vein) would be much less confusing
than "itervalueskeys" and "itervaluesvals" ;)
What I really need (on high level) is something like this:
>>> db['banana'] = {'colour': 'yellow', 'taste': 'sweet'}
>>> db['apple'] = {'colour': 'green', 'taste': 'kinda sweet'}
>>> db['dust'] = {'taste': 'awful'} # no colour
>>> db.query.values_for('colour')
['yellow', 'green']
In Tokyo Tyrant this can be achieved by adding "get\x00colour" (if I recall
correctly). I'm not sure if Tokyo Cabinet has something like that but I thought
that
tctdbiternext3 would do that.
Do I understand correctly that the code below will give me what I described
above?
from functools import izip
def values_for(q, col_name):
"Returns iterator for all possible values of given column."
for ks, vs in izip(q.itervalueskeys(), q.itervaluesvals()):
if col_name in ks:
yield vs[ks.index(ks)]
(this doesn't ensure that the values are distinct)
This will work regardless of column order in records, right?
Original comment by neithere
on 30 Jun 2010 at 7:19
Ahhhh...., I obviously misunderstood the issue :(
Ok, quick solution (I hope I got it right this time):
>>> import tokyo.cabinet as tc
>>> class MyTDB(tc.TDB):
... def values_for(self, column):
... result = set()
... def process_result(key, value):
... result.add(value[column])
... return 0
... q = self.query()
... q.filter(column, tc.TDBQCSTRINC, '')
... q.process(process_result)
... return result
...
>>> tdb = MyTDB()
>>> tdb.open('colors.tct', tc.TDBOWRITER | tc.TDBOCREAT)
>>> tdb['banana'] = {'color': 'yellow', 'taste': 'sweet'}
>>> tdb['apple'] = {'color': 'green', 'taste': 'kinda sweet'}
>>> tdb['dust'] = {'taste': 'awful'}
>>> tdb['kiwi'] = {'color': '', 'taste': 'not bad'}
>>> print(tdb.values_for('color'))
set(['', 'green', 'yellow'])
I don't know if this is fast enough though.
In the meantime, I'm gonna try and implement iterators over _queries_ (this is
the part I missed, sorry), and see if there is a faster way to do what you
want. I could not find a reference to what you describe in the Tokyo Tyrant
documentation though (did you use 'tcrdbmisc'?).
thanks for the help
malek
Original comment by lekma...@gmail.com
on 1 Jul 2010 at 7:57
I think you used a combination of 'tcrdbqrysearchget' 'tcrdbqryrescols', I need
to understand what those do...
Original comment by lekma...@gmail.com
on 1 Jul 2010 at 10:36
Your best bet for now would be something like:
>>> import tokyo.cabinet as tc
>>> class MyTDB(tc.TDB):
... def values_for(self, column):
... return set(value[column] for value in tdb.itervalues() if column in
value)
...
>>> tdb = MyTDB()
>>> tdb.open('colors.tct', tc.TDBOWRITER | tc.TDBOCREAT)
>>> tdb['banana'] = {'color': 'yellow', 'taste': 'sweet'}
>>> tdb['apple'] = {'color': 'green', 'taste': 'kinda sweet'}
>>> tdb['dust'] = {'taste': 'awful'}
>>> tdb['kiwi'] = {'color': '', 'taste': 'not bad'}
>>> tdb['lime'] = {'color': 'green', 'taste': 'zingy'}
>>> print(tdb.values_for('color'))
set(['', 'green', 'yellow'])
If you want more details checkout the TDBQueryIter branch, build it and run the
attached file (be careful, it will create a big db file (250MB here)).
I'm not completely convinced queries need iterators (maybe only over keys)...
Original comment by lekma...@gmail.com
on 2 Jul 2010 at 1:35
Attachments:
Thanks a lot! Sorry for not being clear at the beginning, I should have
mentioned queries. And sorry for not helping with the code, I have a very
little knowledge of C.
Concerning iterators: they are certainly needed for keys, but maybe not for
values. When developing Pyrant (for TT) we used that "get\0foo" because of a)
TT's getitem not being as cheap as in bare TC, and b) the significant overhead
of converting the whole value from the serialized string to a dictionary just
to get a single column or two. Obviously, for tokyo.cabinet only "b" makes
sense. Anyway, I think iterator + getitem will do.
P.S.: tried the file, realized how incredibly slow the queries are in TC on
large datasets... a simple count() on an empty query (no conditions) takes a
lot of time. Roughly the same results with python-tokyo and pyrant. Works fine
with thousands of records but not millions. By the way, Tyrant eats up to
~200MB during count() when queried via pyrant, and the the Python process with
tokyo.cabinet (both trunk and TDBQueryIter) eats up to 500 MB (with the same
260MB file). In both cases the process takes around 20..40 seconds to finish
(presumably depending on the overall system load). Looks like something can be
optimized in TC itself. But I have absolutely no idea about how it even works.
I can't imagine how an iteration with i++ can take so much time and eat so much
memory. This operation even does not collect the keys nor does it do any
comparison. Moreover, adding conditions does not seem to affect the duration in
a measurable way (the queries seem to be faster with conditions that without
them but I just can't believe it). Playing with bnum and friends didn't work
for me. Hm...
Original comment by neithere
on 2 Jul 2010 at 10:06
> Thanks a lot! Sorry for not being clear at the beginning, I should
> have mentioned queries. And sorry for not helping with the code, I
> have a very little knowledge of C.
>
> Concerning iterators: they are certainly needed for keys, but maybe
> not for values. When developing Pyrant (for TT) we used that
> "get\0foo" because of a) TT's getitem not being as cheap as in bare
> TC, and b) the significant overhead of converting the whole value
> from the serialized string to a dictionary just to get a single
> column or two. Obviously, for tokyo.cabinet only "b" makes sense.
> Anyway, I think iterator + getitem will do.
atm you already get an iterator (tuple) as the result of
R/TDBQuery.search()
That said, if I find the time and a simple way to do it (simpler that
the actual TDBQuery branch) I'll try and implement keys iterator for
queries (this will have to wait for a 0.6.2 release). I don't want to
do the search again and again implicitly each time you want to iterate
over a query, so I need to keep track of changes to the query (sort,
filter,...). I'll also refactor the DB iterators for 0.6.2 (there's a
lot of duplicate code in here atm).
>
> P.S.: tried the file, realized how incredibly slow the queries are in
> TC on large datasets... a simple count() on an empty query (no
> conditions) takes a lot of time. Roughly the same results with
> python-tokyo and pyrant. Works fine with thousands of records but not
> millions. By the way, Tyrant eats up to ~200MB during count() when
> queried via pyrant, and the the Python process with tokyo.cabinet
> (both trunk and TDBQueryIter) eats up to 500 MB (with the same 260MB
> file). In both cases the process takes around 20..40 seconds to
> finish (presumably depending on the overall system load). Looks like
> something can be optimized in TC itself. But I have absolutely no
> idea about how it even works. I can't imagine how an iteration with
> i++ can take so much time and eat so much memory. This operation even
> does not collect the keys nor does it do any comparison. Moreover,
> adding conditions does not seem to affect the duration in a
> measurable way (the queries seem to be faster with conditions that
> without them but I just can't believe it). Playing with bnum and
> friends didn't work for me. Hm...
>
afaict count() eats up nothing in terms of memory (search() does
though). Can you adjust the attached tests for pyrant, run them and
report? I first run test_tdb_big_query_count.py, then start a server
with: ttserver /tmp/big_tc_test.tct and then run
test_rtdb_big_query_count.py and test_pyrant_big_query_count.py
On my computer (amd64 with 4GB ram), I get 7 secs for an empty (as in no
conditions, nor sorting) query count() with both pyrant and
tokyo.tyrant which is coherent with the 7 secs I get for an empty
query search() with tokyo.cabinet.
Memory wise, allocation seems to be relatively normal (well, you really
should have a look at another container for the result of
query.columns() in pyrant, this eats up 2GB when run against the
big_tc_test.tct database). Compared to their equivalent in python the
results of search() for both tokyo.cabinet and tokyo.tyrant seem to be
in par (keep in mind that in Tokyo Tyrant there's a whole lot of
socket reading overhead influencing memory usage (roughly twice the
memory is allocated)).
So, if you don't have objections (or if you really really need query
iterators before that), I'll release current trunk as 0.6.1 when I come
back from my vacations, in a week and a half.
Thanks for all your help,
malek
Original comment by lekma...@gmail.com
on 9 Jul 2010 at 9:59
Attachments:
Original issue reported on code.google.com by
lekma...@gmail.com
on 24 Jun 2010 at 3:28