Mardak / profile

2 stars 7 forks source link

Fix #84: Add odp rules #95

Open mzhilyaev opened 10 years ago

Mardak commented 10 years ago

How big of a file does the script generate from the 5000 odp.txt? Should we shrink that source file if it generates something too big?

Also, not sure how we should include "build" dependencies for the perl pieces: Can't locate JSON.pm in @INC

mzhilyaev commented 10 years ago

The size of the file is 491505, slightly less than textModel.json and uuidMapping.json of edrules

Mardak commented 10 years ago

@oyiptong the size of textModel should scale based on the number of categories? The odp.txt file here seems to have 867 categories that happen to be used in these top 5000 sites, so that should seem to result in a 100MB textModel?

oyiptong commented 10 years ago

There will be an increase, but I'm not sure by how much. Let's do some rough estimation. What will have the most impact on size will be the number of categories.

Given that:

  1. the number of words is fixed
  2. the average category line is < 20 ASCII characters
  3. a probability line is 18 characters
  4. let's not count whitespace

867 categories should yield approx:

That makes it ~83MB without whitespace. 100MB sounds about fair

Other text models (using something else than naive bayes) will be much smaller.