endlessm / azafea

Service to track device activations and usage metrics
Mozilla Public License 2.0
10 stars 2 forks source link

Add a command to renormalize vendors #23

Closed bochecha closed 4 years ago

bochecha commented 5 years ago

Something like:

$ azafea -c config.toml normalize-vendors MODEL_NAME COLUMN_NAME

This will be necessary to really fix #22.

And since the current vendor mapping is something I threw down quickly, it's going to go through improvements over time, and so we will need such a command when that happens.

bochecha commented 5 years ago

So making a generic command like this is a bit of an ugly beast…

Ideally I'd like Azafea plugins to be able to add their own subcommands, so in this case this would become:

$ azafea -c config.toml activation normalize-vendors

Since that command would be implemented by the activation event processor, it would know which model/column to normalize.


However that's hard to implement well and it might require rethinking the way event processor plugins register into Azafea.

We did need to normalize the existing vendors though, so for now we went with a quick adhoc script due to lack of time for doing the above.

As that script would become the basis for a dedicated command, I'll paste it below:

import sys

from azafea.config import Config
from azafea.model import Db
from azafea.vendors import normalize_vendor

from azafea.event_processors.activation.v1 import Activation

def progress(current, total):
    bar_length = 60
    done = int(bar_length * current / total)
    remaining = bar_length - done

    print(f'\r|{"#" * done}{" " * remaining}|  {current} / {total}', end='')

def renormalize_chunk(start, stop):
    with db as dbsession:
        for activation in dbsession.query(Activation).order_by(Activation.id).slice(start, stop):
            activation.vendor = normalize_vendor(activation.vendor)
            dbsession.add(activation)

CHUNK_SIZE = 5000

config_file = sys.argv[1]
config = Config.from_file(config_file)
db = Db(config.postgresql.host, config.postgresql.port, config.postgresql.user,
        config.postgresql.password, config.postgresql.database)

with db as dbsession:
    num_activations = dbsession.query(Activation).count()

for i in range(0, num_activations, CHUNK_SIZE):
    stop = min(i + CHUNK_SIZE, num_activations)
    renormalize_chunk(i, stop)
    progress(stop, num_activations)

progress(num_activations, num_activations)

print('\nAll done!')