Tradeshift / blayze

A fast and flexible Naive Bayes implementation for the JVM
MIT License
19 stars 11 forks source link

Consider interning Strings in MapTable #8

Closed rasmusbergpalm closed 6 years ago

rasmusbergpalm commented 6 years ago

To save memory. Especially on the first string in the pair, which is the outcome, will be repeated a lot.

Unfortunately the String.intern() method uses a fixed size hashmap which is controlled by XX:StringTableSize, which is annoying to have to tune, and a potential performance problem if not tuned correctly.

Consider using https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html or something similar which automatically resizes the hashtable.

liufuyang commented 6 years ago

hmmm.... 🤔

How about the idea of saving an set of outcome as outcome_set in memory and perhaps even do the same on feature as well, then in the model we do

val outcome_index = outcome_set.toList().indexOf(outcome)
val feature_index = feature_set.toList().indexOf(feature)

// then use indexes  to save and access table values?

table[outcome_index, feature_index] = count

Looks like all these can just implemented in the Table class internally, so the outside code doesn't have to change or care about this, only when the table does put and get, it does a set.add operation and an IndexOf operation. Maybe?

rasmusbergpalm commented 6 years ago

How would this be better than interning?

liufuyang commented 6 years ago

perhaps you don't need to worry about the length?

also I am thinking instead of using Pari(outcome_index, feature_index) as the final key to a map, we could make a long value by something like feature_index * 1_000_000 + outcome_index and use this long value as key? Would that somehow same more space? 🤔

liufuyang commented 6 years ago

looks like current 20news model size is around 35MB in memory and 6.6M in Protobuf

liufuyang commented 6 years ago

Using intern the memory size of model reduces to 27MB, interesting....

rasmusbergpalm commented 6 years ago

fixed by #10