Open fraserh opened 9 years ago
Sounds good. What are the applications of this?
Smaller lists to send to match.py (more specific categories means the n^2 algo has less loops)
I have the combined meat a run (about 1500 items) - took about 15 seconds. I don't think you'd want any more items than that. Pantry is a couple thousand though.
How much of a difference does it make in terms of actual time?
You mean categories vs non-categories dbs?
Passing match.py all 7000 items; wouldn't finish on my macbook after about 3 mins
It wouldn't finish at all or it would take longer than 3 mins?
It would take longer than 3 mins. My estimate is 6 mins
Also, wouldn't it prevent more false positives by only comparing like categories?
That's alright. This only needs to run once or twice per week. It might. Let's worry about that if it happens too much though, and when we have actually applied the data to some sample runs. Otherwise we're forcing ourselves into a unified category system without fully understanding the whole data set.
The db shouldn't expect unified categories anyway. It'll be easier to map the each store's category to a unified category in a separate table than to try to save the item under a category that we decide.
Need to decide on common categories between coles/Woolworths. As the parser opens each file, I'm currently checking the filename against a list of keywords that correspond to a category (e.g. bakery is converted to bread). Each category has its own csv. There are a few ambiguous categories though (dessert, chilled etc).