fraserh / waldi

0 stars 0 forks source link

Master categories. #12

Open fraserh opened 9 years ago

fraserh commented 9 years ago

Need to decide on common categories between coles/Woolworths. As the parser opens each file, I'm currently checking the filename against a list of keywords that correspond to a category (e.g. bakery is converted to bread). Each category has its own csv. There are a few ambiguous categories though (dessert, chilled etc).

matthewpalmer commented 9 years ago

Sounds good. What are the applications of this?

fraserh commented 9 years ago

Smaller lists to send to match.py (more specific categories means the n^2 algo has less loops)

fraserh commented 9 years ago

I have the combined meat a run (about 1500 items) - took about 15 seconds. I don't think you'd want any more items than that. Pantry is a couple thousand though.

matthewpalmer commented 9 years ago

How much of a difference does it make in terms of actual time?

fraserh commented 9 years ago

You mean categories vs non-categories dbs?

fraserh commented 9 years ago

Passing match.py all 7000 items; wouldn't finish on my macbook after about 3 mins

matthewpalmer commented 9 years ago

It wouldn't finish at all or it would take longer than 3 mins?

fraserh commented 9 years ago

It would take longer than 3 mins. My estimate is 6 mins

fraserh commented 9 years ago

Also, wouldn't it prevent more false positives by only comparing like categories?

matthewpalmer commented 9 years ago

That's alright. This only needs to run once or twice per week. It might. Let's worry about that if it happens too much though, and when we have actually applied the data to some sample runs. Otherwise we're forcing ourselves into a unified category system without fully understanding the whole data set.

The db shouldn't expect unified categories anyway. It'll be easier to map the each store's category to a unified category in a separate table than to try to save the item under a category that we decide.