LukePrior / nbn-upgrade-map

Interactive map showing premises eligible for the NBN FTTP upgrade program.
https://nbn.lukeprior.com/
MIT License
124 stars 11 forks source link

Consider more lightweight DB system #268

Open lyricnz opened 1 year ago

lyricnz commented 1 year ago

Following on from the work in https://github.com/LukePrior/nbn-upgrade-map/issues/255

Ideas for reducing the size of the DB. Perhaps it's time to consider:

lyricnz commented 1 year ago

Notes (just experimenting):

sqlite3 test.db

CREATE TABLE address_principals
(
  gnaf_pid text NOT NULL,
  address text NOT NULL,
  locality_name text NOT NULL,
  postcode text NULL,
  state text NOT NULL,
  latitude numeric(10,8) NOT NULL,
  longitude numeric(11,8) NOT NULL
);

sqlite> .mode csv
sqlite> .import address_principals.csv address_principals

CREATE INDEX address_name_state ON address_principals(locality_name, state);

SELECT gnaf_pid, address, postcode, latitude, longitude FROM address_principals WHERE locality_name = "SOMERVILLE" AND state = "VIC"
lyricnz commented 1 year ago

Related idea - host the latest version of the DB as a static+queryable source. Per https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

lyricnz commented 12 months ago

Possibly a Pandas dataframe saved in parquet file? This is a compressed and queryable format

csv_file = "..../extra/db/address_principals.csv"
df = pd.read_csv(csv_file)
df.to_parquet("rows.parquet")  # 484MB
columns = ['gnaf_pid', 'address', 'postcode', 'latitude', 'longitude']
filters = [
    ('locality_name', '==', 'BLI BLI'),
    ('state', '==', 'QLD'),
]
df2 = pd.read_parquet("rows.parquet", columns=columns, filters=filters)
print(df2.head())
print(len(df2))

emits

2023-09-25 11:37:43,566 INFO     start
         gnaf_pid                       address  ...   latitude   longitude
0  GAQLD720868370              19 WILLOW STREET  ... -26.616855  153.031108
1  GAQLD720861910            21 HILLGROVE COURT  ... -26.618337  153.030099
2  GAQLD720864676               8 WILLOW STREET  ... -26.617568  153.030009
3  GAQLD719559927  UNIT 1, 134 KINGFISHER DRIVE  ... -26.601149  153.015579
4  GAQLD720857697             4 HILLGROVE COURT  ... -26.617756  153.030490

[5 rows x 5 columns]
4844
2023-09-25 11:37:45,287 INFO     elapsed: 0 days 00:00:01.718875
lyricnz commented 12 months ago

We can't use git-lfs on github, because it only provides 1GB of storage, and 1GB/mo of transfer. It's $5/mo for 50+50, but if GHA counts, that'd be used inside a day or two.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage