Open lyricnz opened 1 year ago
Notes (just experimenting):
sqlite3 test.db
CREATE TABLE address_principals
(
gnaf_pid text NOT NULL,
address text NOT NULL,
locality_name text NOT NULL,
postcode text NULL,
state text NOT NULL,
latitude numeric(10,8) NOT NULL,
longitude numeric(11,8) NOT NULL
);
sqlite> .mode csv
sqlite> .import address_principals.csv address_principals
CREATE INDEX address_name_state ON address_principals(locality_name, state);
SELECT gnaf_pid, address, postcode, latitude, longitude FROM address_principals WHERE locality_name = "SOMERVILLE" AND state = "VIC"
Related idea - host the latest version of the DB as a static+queryable source. Per https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/
Possibly a Pandas dataframe saved in parquet file? This is a compressed and queryable format
csv_file = "..../extra/db/address_principals.csv"
df = pd.read_csv(csv_file)
df.to_parquet("rows.parquet") # 484MB
columns = ['gnaf_pid', 'address', 'postcode', 'latitude', 'longitude']
filters = [
('locality_name', '==', 'BLI BLI'),
('state', '==', 'QLD'),
]
df2 = pd.read_parquet("rows.parquet", columns=columns, filters=filters)
print(df2.head())
print(len(df2))
emits
2023-09-25 11:37:43,566 INFO start
gnaf_pid address ... latitude longitude
0 GAQLD720868370 19 WILLOW STREET ... -26.616855 153.031108
1 GAQLD720861910 21 HILLGROVE COURT ... -26.618337 153.030099
2 GAQLD720864676 8 WILLOW STREET ... -26.617568 153.030009
3 GAQLD719559927 UNIT 1, 134 KINGFISHER DRIVE ... -26.601149 153.015579
4 GAQLD720857697 4 HILLGROVE COURT ... -26.617756 153.030490
[5 rows x 5 columns]
4844
2023-09-25 11:37:45,287 INFO elapsed: 0 days 00:00:01.718875
We can't use git-lfs on github, because it only provides 1GB of storage, and 1GB/mo of transfer. It's $5/mo for 50+50, but if GHA counts, that'd be used inside a day or two.
Following on from the work in https://github.com/LukePrior/nbn-upgrade-map/issues/255
Ideas for reducing the size of the DB. Perhaps it's time to consider: