Columns as search results

GoogleCodeExporter commented 9 years ago

Reported by neithere, Jun 18 (6 days ago)

First off, thanks for this package, it seems to be the most complete and 
reliable implementation among all I've seen so far. It supports all important 
functionality, works correctly(!), is well-documented, has tests and is sort of 
actively developed. So I'm switching to it as the TC backend in PyModels. Well, 
anyway.

The table database API includes `tctdbiternext3` which can be used to retrieve 
columns instead of primary keys. I may be missing something as I worked only 
with Tyrant API (for Pyrant project). Fetching columns is useful to get a list 
of possible (existing) values. Currently we have to iterate the keys, fetch 
records one by one and extract the values.
As far as I understand, one cannot use the columns mode with 
tokyocabinet-python. Are you planning to add this?

Andy

Original issue reported on code.google.com by lekma...@gmail.com on 24 Jun 2010 at 3:28

GoogleCodeExporter commented 9 years ago

tctdbiternext3 does not return a list of columns (the doc is confusing about 
that). It returns a TCMAP equivalent to {'': key}.update(tdb[key]).

As for performance, using tctdbiternext3 to iterate over the values is faster 
than using something like: (tdb[key] for key in tdb).

Hmmm..., maybe it's time for iterkeys/itervalues/iteritems. These could return 
faster iterators (Python2/3 compat would suffer though).

Ideas? Comments?

Original comment by lekma...@gmail.com on 25 Jun 2010 at 10:11

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

The iter*s methods would be great but I'm unsure about the 2/3 issue. At some 
point we'll have to migrate all the code but when... maintaining two forks is 
also a bad idea. I'd stick to 2.6..2.7 for some time (maybe until 3.x makes 
into Debian stable?). Or maybe even check the Python version in keys(), 
values() and items() and e.g. return either self.iterkeys() or 
list(self.iterkeys()). This would be a hack but a tolerable one I'd say...

Original comment by neithere on 27 Jun 2010 at 1:25

GoogleCodeExporter commented 9 years ago

So, I added iterkeys, itervalues, iteritems to TDB (and HDB), and 2 additional 
methods to TDB: itervalueskeys (which should be the one you need if I 
understand correctly) and itervaluesvals. These return iterators over TDB's 
values' keys and values respectively (I'm not sure I'm being clear here).

I'd love if you could test the current trunk. If you have no objections/bugs, 
I'll get 0.6.1 out asap.

Original comment by lekma...@gmail.com on 29 Jun 2010 at 3:28

Changed state: Started

GoogleCodeExporter commented 9 years ago

Oh, and I'm not sure about the names, so if you have a better suggestion...

Original comment by lekma...@gmail.com on 29 Jun 2010 at 3:33

GoogleCodeExporter commented 9 years ago

Sorry for the delay. I think names like "iter_column_names" and 
"iter_column_values" (or something in that vein) would be much less confusing 
than "itervalueskeys" and "itervaluesvals" ;)

What I really need (on high level) is something like this:

  >>> db['banana'] = {'colour': 'yellow', 'taste': 'sweet'}
  >>> db['apple'] = {'colour': 'green', 'taste': 'kinda sweet'}
  >>> db['dust'] = {'taste': 'awful'}  # no colour
  >>> db.query.values_for('colour')
  ['yellow', 'green']

In Tokyo Tyrant this can be achieved by adding "get\x00colour" (if I recall 
correctly). I'm not sure if Tokyo Cabinet has something like that but I thought 
that  
tctdbiternext3 would do that.

Do I understand correctly that the code below will give me what I described 
above?

from functools import izip
def values_for(q, col_name):
    "Returns iterator for all possible values of given column."
    for ks, vs in izip(q.itervalueskeys(), q.itervaluesvals()):
        if col_name in ks:
            yield vs[ks.index(ks)]

(this doesn't ensure that the values are distinct)

This will work regardless of column order in records, right?

Original comment by neithere on 30 Jun 2010 at 7:19

GoogleCodeExporter commented 9 years ago

Ahhhh...., I obviously misunderstood the issue :(

Ok, quick solution (I hope I got it right this time):
>>> import tokyo.cabinet as tc
>>> class MyTDB(tc.TDB):
...     def values_for(self, column):
...         result = set()
...         def process_result(key, value):
...             result.add(value[column])
...             return 0
...         q = self.query()
...         q.filter(column, tc.TDBQCSTRINC, '')
...         q.process(process_result)
...         return result
... 
>>> tdb = MyTDB()
>>> tdb.open('colors.tct', tc.TDBOWRITER | tc.TDBOCREAT)
>>> tdb['banana'] = {'color': 'yellow', 'taste': 'sweet'}
>>> tdb['apple'] = {'color': 'green', 'taste': 'kinda sweet'}
>>> tdb['dust'] = {'taste': 'awful'}
>>> tdb['kiwi'] = {'color': '', 'taste': 'not bad'}
>>> print(tdb.values_for('color'))
set(['', 'green', 'yellow'])

I don't know if this is fast enough though.

In the meantime, I'm gonna try and implement iterators over _queries_ (this is 
the part I missed, sorry), and see if there is a faster way to do what you 
want. I could not find a reference to what you describe in the Tokyo Tyrant 
documentation though (did you use 'tcrdbmisc'?).

thanks for the help

malek

Original comment by lekma...@gmail.com on 1 Jul 2010 at 7:57

GoogleCodeExporter commented 9 years ago

I think you used a combination of 'tcrdbqrysearchget' 'tcrdbqryrescols', I need 
to understand what those do...

Original comment by lekma...@gmail.com on 1 Jul 2010 at 10:36

GoogleCodeExporter commented 9 years ago

Your best bet for now would be something like:

>>> import tokyo.cabinet as tc
>>> class MyTDB(tc.TDB):
...     def values_for(self, column):
...         return set(value[column] for value in tdb.itervalues() if column in 
value)
... 
>>> tdb = MyTDB()
>>> tdb.open('colors.tct', tc.TDBOWRITER | tc.TDBOCREAT)
>>> tdb['banana'] = {'color': 'yellow', 'taste': 'sweet'}
>>> tdb['apple'] = {'color': 'green', 'taste': 'kinda sweet'}
>>> tdb['dust'] = {'taste': 'awful'}
>>> tdb['kiwi'] = {'color': '', 'taste': 'not bad'}
>>> tdb['lime'] = {'color': 'green', 'taste': 'zingy'}
>>> print(tdb.values_for('color'))
set(['', 'green', 'yellow'])

If you want more details checkout the TDBQueryIter branch, build it and run the 
attached file (be careful, it will create a big db file (250MB here)).

I'm not completely convinced queries need iterators (maybe only over keys)...

Original comment by lekma...@gmail.com on 2 Jul 2010 at 1:35

Attachments:

test_tdb_big_query.py

GoogleCodeExporter commented 9 years ago

Thanks a lot! Sorry for not being clear at the beginning, I should have 
mentioned queries. And sorry for not helping with the code, I have a very 
little knowledge of C.

Concerning iterators: they are certainly needed for keys, but maybe not for 
values. When developing Pyrant (for TT) we used that "get\0foo" because of a) 
TT's getitem not being as cheap as in bare TC, and b) the significant overhead 
of converting the whole value from the serialized string to a dictionary just 
to get a single column or two. Obviously, for tokyo.cabinet only "b" makes 
sense. Anyway, I think iterator + getitem will do.

P.S.: tried the file, realized how incredibly slow the queries are in TC on 
large datasets... a simple count() on an empty query (no conditions) takes a 
lot of time. Roughly the same results with python-tokyo and pyrant. Works fine 
with thousands of records but not millions. By the way, Tyrant eats up to 
~200MB during count() when queried via pyrant, and the the Python process with 
tokyo.cabinet (both trunk and TDBQueryIter) eats up to 500 MB (with the same 
260MB file). In both cases the process takes around 20..40 seconds to finish 
(presumably depending on the overall system load). Looks like something can be 
optimized in TC itself. But I have absolutely no idea about how it even works. 
I can't imagine how an iteration with i++ can take so much time and eat so much 
memory. This operation even does not collect the keys nor does it do any 
comparison. Moreover, adding conditions does not seem to affect the duration in 
a measurable way (the queries seem to be faster with conditions that without 
them but I just can't believe it). Playing with bnum and friends didn't work 
for me. Hm...

Original comment by neithere on 2 Jul 2010 at 10:06

GoogleCodeExporter commented 9 years ago

> Thanks a lot! Sorry for not being clear at the beginning, I should
> have mentioned queries. And sorry for not helping with the code, I
> have a very little knowledge of C.
> 
> Concerning iterators: they are certainly needed for keys, but maybe
> not for values. When developing Pyrant (for TT) we used that
> "get\0foo" because of a) TT's getitem not being as cheap as in bare
> TC, and b) the significant overhead of converting the whole value
> from the serialized string to a dictionary just to get a single
> column or two. Obviously, for tokyo.cabinet only "b" makes sense.
> Anyway, I think iterator + getitem will do.
atm you already get an iterator (tuple) as the result of
R/TDBQuery.search()

That said, if I find the time and a simple way to do it (simpler that
the actual TDBQuery branch) I'll try and implement keys iterator for
queries (this will have to wait for a 0.6.2 release). I don't want to
do the search again and again implicitly each time you want to iterate
over a query, so I need to keep track of changes to the query (sort,
filter,...). I'll also refactor the DB iterators for 0.6.2 (there's a
lot of duplicate code in here atm).

> 
> P.S.: tried the file, realized how incredibly slow the queries are in
> TC on large datasets... a simple count() on an empty query (no
> conditions) takes a lot of time. Roughly the same results with
> python-tokyo and pyrant. Works fine with thousands of records but not
> millions. By the way, Tyrant eats up to ~200MB during count() when
> queried via pyrant, and the the Python process with tokyo.cabinet
> (both trunk and TDBQueryIter) eats up to 500 MB (with the same 260MB
> file). In both cases the process takes around 20..40 seconds to
> finish (presumably depending on the overall system load). Looks like
> something can be optimized in TC itself. But I have absolutely no
> idea about how it even works. I can't imagine how an iteration with
> i++ can take so much time and eat so much memory. This operation even
> does not collect the keys nor does it do any comparison. Moreover,
> adding conditions does not seem to affect the duration in a
> measurable way (the queries seem to be faster with conditions that
> without them but I just can't believe it). Playing with bnum and
> friends didn't work for me. Hm...
> 
afaict count() eats up nothing in terms of memory (search() does
though). Can you adjust the attached tests for pyrant, run them and
report? I first run test_tdb_big_query_count.py, then start a server
with: ttserver /tmp/big_tc_test.tct and then run
test_rtdb_big_query_count.py and test_pyrant_big_query_count.py

On my computer (amd64 with 4GB ram), I get 7 secs for an empty (as in no
conditions, nor sorting) query count() with both pyrant and
tokyo.tyrant which is coherent with the 7 secs I get for an empty
query search() with tokyo.cabinet.

Memory wise, allocation seems to be relatively normal (well, you really
should have a look at another container for the result of
query.columns() in pyrant, this eats up 2GB when run against the
big_tc_test.tct database). Compared to their equivalent in python the
results of search() for both tokyo.cabinet and tokyo.tyrant seem to be
in par (keep in mind that in Tokyo Tyrant there's a whole lot of
socket reading overhead influencing memory usage (roughly twice the
memory is allocated)).

So, if you don't have objections (or if you really really need query
iterators before that), I'll release current trunk as 0.6.1 when I come
back from my vacations, in a week and a half.

Thanks for all your help,

malek

Original comment by lekma...@gmail.com on 9 Jul 2010 at 9:59

Attachments:

Issue1.tar.bz2

assad2008 / tokyo-python

Columns as search results #1