Open rfishe02 opened 6 years ago
How would we compute the frequency of the n-grams if we use uniq to remove duplicates? Would we compute the frequency first, and then use uniq to combine them?
When we pipe sort to uniq, it would eliminate consecutive duplicate lines. But, if we use the -c command it would provide a count of those duplicate consecutive lines. We can use this as our frequency count.
We would only need to use sort | uniq -c on the n-gram lists to calculate the frequency for each n-gram. Then, we would load the information into the MapBuilder, where would simply use .split() to store the n-gram and frequency count in the map. (The command I wrote turns the frequency list into a .csv).
It would be easier for me to explain how this works with an example. Suppose we have the file "fruits". The frequency of each word is 4 for apple, 3 for banana, 2 for orange, and 1 for grapefruit. The output from the sort | uniq -c command would have the same information.
x@linux:~/xxx/xxx> cat fruits
apple
apple
banana
orange
banana
apple
grapefruit
apple
orange
banana
x@linux:~/xxx/xxx> sort fruits | uniq -c
4 apple
3 banana
1 grapefruit
2 orange
If we use the uniq -c command without the sort command it only returns a count of duplicate consecutive lines. By piping sorted information we can use the properties of the uniq command to our advantage.
x@linux:~/xxx/xxx> uniq -c fruits
2 apple
1 banana
1 orange
1 banana
1 apple
1 grapefruit
1 apple
1 orange
1 banana
Another advantage of this approach is that we can select the top n frequencies from each list. We could probably place a cap on the number of n-grams that we would load into maps, but I'm not sure what the full ramifications of this action would be. I think that the application might fail to recognize n-grams if we decide to do that.
We could add the head command to achieve this.
So, you're worried that the maps might grow too big to be held in memory?
The attempted change made to test large data sets will initially be to cut off the preprocessor as the RAM limit is reached. Future modifications might include storing the data set in a searchable format on disk, so that the size of the data to be processed is limited by hard drive space rather than RAM capacity.
What if we're given a data set that we can't hold in memory?
I think that this could create an issue in both the preprocessor and runtime application.
I think that we could split the builder application, perhaps.
The n-grams could be written directly to disk, then consolidated using the uniq command.
Then, we could select the top 20 or so when building the HashMaps.