bcgov / ols-geocoder

Physical Address Geocoder
Apache License 2.0
10 stars 6 forks source link

Improve space performance of digital trie #149

Closed mraross closed 1 year ago

mraross commented 3 years ago

Having a more space-efficient trie will allow for more occupants and addresses with site-names to be stored in memory.

  1. Burst Trie are as fast as a Trie but only require about the same amount of space as a binary tree , namely, O(N). A Trie requires O(R*N) where R is the number of keys. In our case, we use 37 keys including 26 alphabetic characters and 10 digits plus the space character. Still looking for a good Java implementation. https://pdfs.semanticscholar.org/7efb/a0bd3850a9f64865bfcb0cf4328e207d0498.pdf https://people.eng.unimelb.edu.au/jzobel/fulltext/acmtois02.pdf

  2. MergedTrie is another alternative to a Trie. It also supports suffix matching! https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0215288&type=printable

alixcote commented 1 year ago

Check what the performance/space gain would be. Do we have space considerations, especially as we move to OpenShift?

gleeming commented 1 year ago

This ticket is proposing possible approaches that could help optimize the in-memory footprint and performance as data volumes increase. This could be of concern if, say, the number of sites/occupants from GSR went from a few thousand to a few million. Note that the "space" terminology in this concept refers to both memory space and the string character used to separate words.

cmhodgson commented 1 year ago

The Trie structure is used for a few purposes in the geocoder, to allow for prefix-only or exact matches, allowing for misspellings, on individual words, street names, and localities. IIRC, the trie data structures represent 20-30% of the total memory used by the geocoder. If the number of site names were to increase substantially, this would cause some growth in the size of the tries, however it would grow at worst linearly and more expected at a logarithmic rate - ie. it shouldn't grow as a percentage of total memory use as the other data related to the sites would also need to be stored.

I'm not sure just how much more space-efficient the proposed structures could be, and more importantly it is not clear that the proposed structures support misspelling/spell-correction within the search. We could potentially do those parts separately, but using multiple different data structures is unlikely to save much space. I think that if the number of sites does grow significantly, we may need to do further analysis as to the best way(s) to reduce memory usage, and there might be some value to considering alternative structures to the trie, but I don't think there is much to be gained by this at this point.

alixcote commented 1 year ago

Thanks - are you good if we close this ticket then? We would address if we need made changes that required more memory usage.

cmhodgson commented 1 year ago

To be addressed in the future if/when required.