christian-westbrook / interactive-sentence-predictor

An interactive sentence predictor that generates likely continuations of partial sentences.
0 stars 0 forks source link

Extremely large data sets #8

Open rfishe02 opened 6 years ago

rfishe02 commented 6 years ago

What if we're given a data set that we can't hold in memory?

I think that this could create an issue in both the preprocessor and runtime application.

I think that we could split the builder application, perhaps.

The n-grams could be written directly to disk, then consolidated using the uniq command.

Then, we could select the top 20 or so when building the HashMaps.

christian-westbrook commented 6 years ago

How would we compute the frequency of the n-grams if we use uniq to remove duplicates? Would we compute the frequency first, and then use uniq to combine them?

rfishe02 commented 6 years ago

When we pipe sort to uniq, it would eliminate consecutive duplicate lines. But, if we use the -c command it would provide a count of those duplicate consecutive lines. We can use this as our frequency count.

We would only need to use sort | uniq -c on the n-gram lists to calculate the frequency for each n-gram. Then, we would load the information into the MapBuilder, where would simply use .split() to store the n-gram and frequency count in the map. (The command I wrote turns the frequency list into a .csv).

rfishe02 commented 6 years ago

It would be easier for me to explain how this works with an example. Suppose we have the file "fruits". The frequency of each word is 4 for apple, 3 for banana, 2 for orange, and 1 for grapefruit. The output from the sort | uniq -c command would have the same information.

x@linux:~/xxx/xxx> cat fruits
apple
apple
banana
orange
banana
apple
grapefruit
apple
orange
banana
x@linux:~/xxx/xxx> sort fruits | uniq -c
      4 apple
      3 banana
      1 grapefruit
      2 orange

If we use the uniq -c command without the sort command it only returns a count of duplicate consecutive lines. By piping sorted information we can use the properties of the uniq command to our advantage.

x@linux:~/xxx/xxx> uniq -c fruits
      2 apple
      1 banana
      1 orange
      1 banana
      1 apple
      1 grapefruit
      1 apple
      1 orange
      1 banana
rfishe02 commented 6 years ago

Another advantage of this approach is that we can select the top n frequencies from each list. We could probably place a cap on the number of n-grams that we would load into maps, but I'm not sure what the full ramifications of this action would be. I think that the application might fail to recognize n-grams if we decide to do that.

We could add the head command to achieve this.

christian-westbrook commented 6 years ago

So, you're worried that the maps might grow too big to be held in memory?

christian-westbrook commented 6 years ago

The attempted change made to test large data sets will initially be to cut off the preprocessor as the RAM limit is reached. Future modifications might include storing the data set in a searchable format on disk, so that the size of the data to be processed is limited by hard drive space rather than RAM capacity.