bower / registry

The Bower registry
https://registry.bower.io/packages
MIT License
292 stars 66 forks source link

[Heads up] - writing new tables to production database #86

Closed rayshan closed 8 years ago

rayshan commented 10 years ago

Will backup before this is executed. Will not touch the only packages table so no impact to production clients. Please let me know if you have objections.

Note that below tables do not include all the data we want. Next we need to migrate to a more full featured packages table, which should contain the entire bower.json from packages. More on this later.

Table suffixes- not needed due to small number of tables & lack of dimension tables

Todo

-- bower services stats
CREATE TABLE status (
  status_all JSON NOT NULL -- json for flexibility
)
;

-- e.g. 'last_etl_runtime': 'timestamp'

CREATE TABLE stats_packages (
  date         DATE                     NOT NULL,
  package_name TEXT                     NOT NULL,
-- not restricted as foreign key as there could be things tracked that aren't in the registry
  version      TEXT,
  rank         INTEGER,
-- based on installs
  installs     INTEGER                  NOT NULL,
-- integer +- 2.147 billion, if not installs, shouldn't be in this table
  users        INTEGER,
  updated_at   TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

CREATE TABLE github (
  bower_package_name TEXT                     NOT NULL,
-- not restricted as foreign key as there could be things tracked that aren't in the registry
  raw_repo_info      JSON                     NOT NULL,
-- e.g. https://api.github.com/repos/bower/bower
  raw_commits        JSON                     NOT NULL,
-- e.g. https://api.github.com/repos/bower/bower/stats/participation
  updated_at         TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

-- storing raw json from github api so we don't have to deal w/ github api changes
-- only stores latest stats

CREATE TABLE stats_geo (
  date          DATE                     NOT NULL,
  country_name  TEXT, -- based on GA
  country_iso_2 CHAR(2),
-- ISO 3166-1 alpha-2
  country_iso_3 CHAR(3) PRIMARY KEY      NOT NULL,
-- ISO 3166-1 alpha-3
  users         INTEGER,
  installs      INTEGER                  NOT NULL,
-- if not installs, shouldn't be in this table
  updated_at    TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

CREATE TABLE stats_geo_internet_users (
  country_iso_3  CHAR(3) PRIMARY KEY,
  internet_users INTEGER,
  updated_at     TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

-- bower users & npm installs
CREATE TABLE stats_overview (
  date                   DATE                     NOT NULL,
  user_type              CHAR(1)                  NOT NULL, -- N for new / E for existing / T for total (npm only)
  users                  INTEGER                  NOT NULL,
  installs               INTEGER                  NOT NULL, -- bower package installs
  installs_npm           INTEGER                  NOT NULL, -- from npm api
  registry_package_count INTEGER                  NOT NULL,
  updated_at             TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

CREATE TABLE stats_web_traffic (
  date       DATE                     NOT NULL,
  source     TEXT                     NOT NULL, -- GA GH
  users      INTEGER                  NOT NULL,
  sessions   INTEGER                  NOT NULL,
  updated_at TIMESTAMP WITH TIME ZONE NOT NULL
-- UTC
)
;

Good to go

-- bower CLI environment data
CREATE TABLE stats_env (
  date         DATE                     NOT NULL,
  version_cli  TEXT,
  version_node TEXT,
  os           TEXT,
  users        INTEGER                  NOT NULL,
  updated_at   TIMESTAMP WITH TIME ZONE NOT NULL
  -- UTC
);
COMMENT ON TABLE stats_env IS 'Stats on Bower CLI users'' environment.';
rayshan commented 10 years ago

This went through a few iterations, since no objections, I'll implement it piece by piece. First up is stats_env_f so we have some new insight for the core team.

sheerun commented 10 years ago

I don't see any _d suffixes. Why even do suffixes?

rayshan commented 10 years ago

Suffixes & dimension tables are useful when you have a huge warehouse of tables, and tables have too many columns that need to be broken into separate dimension tables. I don't foresee we'll ever have that need, so we can just not do suffixes.

sheerun commented 10 years ago

I'd vote for keeping it simple and strip suffixes. stats_ prefix is probably enough.

I don't really understand all the tables you listed, but if you have any doubts about the design we can discuss it on the IRC, or here.

rayshan commented 10 years ago

Sure thing, I'll edit the plan later. I'll follow up w/ you on IRC.