repository containing utilities for generating random data, for the DW laboratories.
poetry install
Run poetry run airbase-gen --help
usage: airbase-gen [-h] {csv,sql} ...
positional arguments:
{csv,sql} sub-command help
optional arguments:
-h, --help show this help message and exit
$poetry run airbase-gen csv --help
usage: airbase-gen csv [-h] [--prob-noisy PROB_NOISY] [--prob-bad PROB_BAD] [-r ROWS] OUT_PATH
positional arguments:
OUT_PATH path to output folder
optional arguments:
-h, --help show this help message and exit
--prob-noisy PROB_NOISY
A probability that a row is generated with noisy quality of data (default: 0.0)
--prob-bad PROB_BAD A probability that a row is generated with bad quality of data (default: 0.0)
-r ROWS, --rows ROWS number of rows to create (default: 1000)
docker
and docker-compose
docker-compose up -d
and wait until it finishes0.0.0.0:54320
with postgres:admin
credentials, and a pgadmin4 instance at 0.0.0.0:5050
with admin@pgadmin.com:pgadmin
credentials. Note that these two are different credentials for different services.$poetry run airbase-gen sql --help
usage: airbase-gen sql [-h] [--prob-noisy PROB_NOISY] [--prob-bad PROB_BAD] [-r ROWS] [--hard] [-v] [--db-name DB_NAME] [--db-user DB_USER] --db-pwd DB_PWD [--db-host DB_HOST]
[--db-port DB_PORT]
optional arguments:
-h, --help show this help message and exit
--prob-noisy PROB_NOISY
A probability that a row is generated with noisy quality of data (default: 0.0)
--prob-bad PROB_BAD A probability that a row is generated with bad quality of data (default: 0.0)
-r ROWS, --rows ROWS number of rows to store (default: 1000)
--hard wipe database before insertion (default: False)
-v, --verbose sets SQLAlchemy as verbose (default: False)
--db-name DB_NAME database name (default: postgres)
--db-user DB_USER database user (default: postgres)
--db-pwd DB_PWD database password (default: None)
--db-host DB_HOST database host (default: 0.0.0.0)
--db-port DB_PORT database port. The default is 54320, set by docker-compose (default: 54320)
The library uses a BaseConfig
class with more settings that can be overriden. To write
your own generator, you can look at how this is done within the code
cli.py
/tests
The baseline is
config
object from the BaseConfig
class, with custom parameters
config
object to the constructor of AircraftGenerator
, creating a generator ag
instanceag.populate()
to generate random elements in memory. These are stored in lists as attributes of ag
ag.to_csv()
or ag.to_sql()
depending on what you wantIn code, this is roughly equivalent to
from acme_data_generation.base.config import BaseConfig
from acme_data_generation.scripts.generate import AircraftGenerator
config = BaseConfig(
size=rows,
prob_good=(1 - (prob_noisy + prob_bad)),
prob_noisy=prob_noisy,
prob_bad=prob_bad,
**other_args
)
ag = AircraftGenerator(config)
ag.populate()
ag.to_csv(path=out_path)
Make sure to read more about the program in the rationale
run pytest tests/
.
testdb
, at the same postgres address that the one being used by the airbase-gen sql
command.make coverage
to run tests with coverage
poor man's badge: test coverage 77%
make memprofile.generate
to produce a memory usage profile
poor man's badge: memory consumption: 21.1[MB]@1000[rows].
Data generated with this program should
You can read more about how this generator was developed here in this short document