Antar1011 / Onix

The Pokemon Data-Mining Package
http://onix.readthedocs.io/en/latest/
GNU General Public License v3.0
7 stars 0 forks source link

Rethink moveset report strategy #104

Closed Antar1011 closed 7 years ago

Antar1011 commented 7 years ago

get_abilities, the first query I've built for the moveset report, is really slow when operating on a full dataset. And when you think about it, that makes sense, because it's still having to scan through every Pokemon used in every battle in the metagame-month, since the filter is on a group-concat, which I suspect is a rather expensive operation (especially compared to a straight table lookup). OTOH, subsequent runs in the same session are quite a bit faster, thanks, I'm guessing, to the subquery being cached.

So the question is whether I want to let this ride, write a set of queries that work, and not worry about performance until I need to (there's merit to that when I don't know what my performance targets are or whether I'm already meeting them) or do I redesign now so I potentially don't have to rewrite everything further down the line?

Basically, I can think of two alternative strategies:

  1. Create a view / temp table / permanent table of weighted counts per SID for each metagame and weighting, add an indexed column for forme-concatenation, and then run my moveset queries against that.

    Advantages:

    • Especially given the forme-concat column, I expect this to execute much faster
    • This table is what I envisioned creating for the public "deep query" API

    Disadvantages:

    • Startup cost
    • Workflow needs a startup step
    • Maintenance / organization burden for the derived tables
    • Limited to preset weights
    • Including startup cost, it's possible that this won't be significantly faster than what I have now, depending on how caching works
  2. Generate moveset reports all at once. That is, get abilities for all Pokemon in the tier at once by just doing one master group-by-aggregate for each type of report

    Advantages:

    • Should be the most performant option, as it cuts down massively on the number of queries

    Disadvantages:

    • Completely changes the workflow I envisioned

I mean, I think through the process of writing this up, I've come up with my answer, but I am going to sit on it for a minute.

Antar1011 commented 7 years ago

Going with (2). Stories updated accordingly.