Closed rasmusbergpalm closed 6 years ago
hmmm.... 🤔
How about the idea of saving an set of outcome as outcome_set
in memory and perhaps even do the same on feature as well, then in the model we do
val outcome_index = outcome_set.toList().indexOf(outcome)
val feature_index = feature_set.toList().indexOf(feature)
// then use indexes to save and access table values?
table[outcome_index, feature_index] = count
Looks like all these can just implemented in the Table
class internally, so the outside code doesn't have to change or care about this, only when the table does put and get, it does a set.add operation and an IndexOf operation. Maybe?
How would this be better than interning?
perhaps you don't need to worry about the length?
also I am thinking instead of using Pari(outcome_index, feature_index) as the final key to a map, we could make a long value by something like feature_index * 1_000_000 + outcome_index
and use this long value as key? Would that somehow same more space? 🤔
looks like current 20news model size is around 35MB in memory and 6.6M in Protobuf
Using intern the memory size of model reduces to 27MB, interesting....
fixed by #10
To save memory. Especially on the first string in the pair, which is the outcome, will be repeated a lot.
Unfortunately the String.intern() method uses a fixed size hashmap which is controlled by
XX:StringTableSize
, which is annoying to have to tune, and a potential performance problem if not tuned correctly.Consider using https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html or something similar which automatically resizes the hashtable.