WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

how many memory if load 1 billion token with average length 5(range from 1 to 20) #136

Closed SeekPoint closed 3 years ago

WojciechMula commented 3 years ago

I personally hate this kind of answers, but have to say that: "it depends".

It depends on three major factors:

  1. distribution of words -- the underlying data structure compresses common prefixes, so if there are many unique words, which don't share prefixes with others, then data structure can become huge;
  2. python version -- the module adheres to python's internal representation of strings, i.e. a symbol might occupy one byte, two, or four bytes; this obviously affects on memory consumption, as we need to save the symbols;
  3. python's allocator -- the internal memory fragmentation likely increase the real memory consumption, because we allocate usually small memory chunks.

All in all I'm unable to estimate it. You can obtain memory usage of an existing automaton: https://pyahocorasick.readthedocs.io/en/latest/#get-stats-dict.

SeekPoint commented 3 years ago

it help.

SeekPoint commented 3 years ago

I think, since pyahocorasick is implement with ANSI C, and python plays as a wrapper, so 'python version' and 'python allocator' should not affects the memory consumption.

SeekPoint commented 3 years ago

I had test on 20 millions token of Chinese, it looks the memory consumption is not a problemn on a 16g win10 laptop

WojciechMula commented 3 years ago

@loveJasmine I'm happy it worked for you. :) Would you please paste here the output from get_stats method? It might help others.

SeekPoint commented 3 years ago

my log shows every 3 million tokens take 4% * 16GB memory I use: def printSys(): info = psutil.virtual_memory() print(u'mem percentage:', info.percent) for statistation

na-sketch commented 1 year ago

I wonder if optimization for logographic languages like Chinese is necessary? For example, trie tree works best for Western European languages like English with a very limited character set.

SeekPoint commented 1 year ago

@lerner-zhang , so, do you have any ideal on trie tree works with Chinese or anything replace TRIE tree with Chinese?

na-sketch commented 1 year ago

@loveJasmine I have found some papers on this issue but not read yet. You can google using keywords like "trie" and "Chiinese".