ericlagergren / marisa-trie

Automatically exported from code.google.com/p/marisa-trie
Other
0 stars 0 forks source link

How can I merge large number of large marisa tries ? #16

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
I have about 400 marisa tries each of size 9MB .. I wish to merge them. How can 
I do so ?

Original issue reported on code.google.com by neshmai...@gmail.com on 19 Jan 2013 at 5:10

GoogleCodeExporter commented 8 years ago
Umm... You have to dump the tries and then build a new one.
However, the memory usage of building a new marisa trie will be unacceptable if 
all the keys are unique.
It might require more than 50GiB of memory.

Original comment by susumu.y...@gmail.com on 20 Jan 2013 at 4:03

GoogleCodeExporter commented 8 years ago
I'm not sure, but cloud computing, such as Amazon EC2, might be a solution if 
the cost is acceptable.

Regards,

Original comment by susumu.y...@gmail.com on 20 Jan 2013 at 4:08

GoogleCodeExporter commented 8 years ago
Hmm .. I am a student and Amazon EC2 is not really what I was looking for but 
thanks. Also .. will marisa-build open the dump file in memory mapped I/O ? .. 
then probably it will not be a problem. My keys  are not unique ...

Thanks

Original comment by neshmai...@gmail.com on 20 Jan 2013 at 2:52

GoogleCodeExporter commented 8 years ago
Unfortunately, marisa-build does not use memory mapped I/O.

Instead, if there are many repeated keys in dump, you can use a combination of 
'sort' and 'uniq' to remove the duplication.

Original comment by susumu.y...@gmail.com on 20 Jan 2013 at 4:01