Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Strategically index before merging in API calls? #100

Closed bmschmidt closed 8 years ago

bmschmidt commented 8 years ago

Since @organisciak is making some changes and knows the data analysis side of python better than me, I want to tag him on a nagging question of mine:

Every Bookworm query currently involves merging two pandas dfs (see link for the location.)

On large queries (those returning more than 500 rows? 5,000?) I've been worried that since there's no pandas index on the keys used for the join, it may be an expensive n^2 operation. So maybe things would go faster if we sometimes or always built an index before doing that join.

OTOH, I don't really know much about pandas. Maybe it does that automatically, like dplyr in R.

organisciak commented 8 years ago

I'll take a look. My dev server just went down and I don't have hard reboot access, so I'll tackle this when it's back up.

By the way, you were the one that first turned me on to Pandas, so I can't claim to be an expert: I was all-in R until recently.

On Mon, Apr 11, 2016, 6:22 AM Benjamin Schmidt notifications@github.com wrote:

Since @organisciak https://github.com/organisciak is making some changes and knows the data analysis side of python better than me, I want to tag him on a nagging question of mine:

Every Bookworm query currently involves merging two pandas dfs https://github.com/Bookworm-project/BookwormDB/blob/master/bookwormDB/general_API.py#L260-L265 (see link for the location.)

On large queries (those returning more than 500 rows? 5,000?) I've been worried that since there's no pandas index on the keys used for the join, it may be an expensive n^2 operation. So maybe things would go faster if we sometimes or always built an index before doing that join.

OTOH, I don't really know much about pandas. Maybe it does that automatically, like dplyr in R.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/100

organisciak commented 8 years ago

I did some benchmarking, the Pandas merge is very effective as is. The overhead from indexing is too great if not needed later.

One thing that I couldn't make sense of is in what circumstance there would be no overlapping columns, resulting in the dummy_merge_variable case.

bmschmidt commented 8 years ago

I think the case that the dummy column comes out is when there are no groups at all: as in the queries to populate the browser that just want the full word count and the full text count for the corpus, not faceted in any way. Those return data frames with a single column that is the variable count. There needs to be something else to join on.

On Mon, Apr 11, 2016 at 6:36 PM, Peter Organisciak <notifications@github.com

wrote:

I did some benchmarking, the Pandas merge is very effective as is. The overhead from indexing is too great if not needed later.

One thing that I couldn't make sense of is in what circumstance there would be no overlapping columns, resulting in the dummy_merge_variable case.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/100#issuecomment-208594649

organisciak commented 8 years ago

Ah, I see now. Makes sense. It won't make any notable difference other than succinctness, but I'll update it to join on index.

On Mon, Apr 11, 2016 at 7:41 PM Benjamin Schmidt notifications@github.com wrote:

I think the case that the dummy column comes out is when there are no groups at all: as in the queries to populate the browser that just want the full word count and the full text count for the corpus, not faceted in any way. Those return data frames with a single column that is the variable count. There needs to be something else to join on.

On Mon, Apr 11, 2016 at 6:36 PM, Peter Organisciak < notifications@github.com

wrote:

I did some benchmarking, the Pandas merge is very effective as is. The overhead from indexing is too great if not needed later.

One thing that I couldn't make sense of is in what circumstance there would be no overlapping columns, resulting in the dummy_merge_variable case.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub < https://github.com/Bookworm-project/BookwormDB/issues/100#issuecomment-208594649

— You are receiving this because you modified the open/close state.

Reply to this email directly or view it on GitHub https://github.com/Bookworm-project/BookwormDB/issues/100#issuecomment-208659667