Closed crangelsmith closed 4 years ago
@ld-archer Thanks again for uploading the new OD matrices. Some of the files seem to be corrupted (?), see the errors below:
❯ unzip OD_prob_matrices.zip ─╯
Archive: OD_prob_matrices.zip
inflating: F_0to4_prob_matrix_EW.csv bad CRC a4b9cb8e (should be 963bec50)
inflating: F_16to19_prob_matrix_EW.csv
inflating: F_20to24_prob_matrix_EW.csv
inflating: F_25to34_prob_matrix_EW.csv
inflating: F_35to49_prob_matrix_EW.csv
inflating: F_50to64_prob_matrix_EW.csv
inflating: F_5to15_prob_matrix_EW.csv
inflating: F_65to64_prob_matrix_EW.csv
inflating: F_75plus_prob_matrix_EW.csv
inflating: M_0to4_prob_matrix_EW.csv
error: invalid compressed data to inflate
inflating: M_16to19_prob_matrix_EW.csv
inflating: M_20to24_prob_matrix_EW.csv
inflating: M_25to34_prob_matrix_EW.csv
inflating: M_35to49_prob_matrix_EW.csv
inflating: M_50to64_prob_matrix_EW.csv
inflating: M_5to15_prob_matrix_EW.csv
inflating: M_65to64_prob_matrix_EW.csv
error: invalid compressed data to inflate
inflating: M_75plus_prob_matrix_EW.csv
By simply saving the new OD matrices in sparse representation, we can reduce the size by a factor of 3, but it is still considerably larger than the previous OD matrices.
# new OD matrix, CSV format
1.0G F_16to19_prob_matrix_EW.csv
# new OD matrix, NPZ format
381M F_16to19_prob_matrix_EW.npz
# old OD matrix, NPZ format
312K F_16to19_OD_matrix_EW.npz
@crangelsmith @ld-archer Here is the code to make OD matrices sparse:
# Make OD matrices sparse
import glob
import numpy as np
import os
import pandas as pd
from scipy.sparse import coo_matrix
import scipy
# ---- INPUTS --------------
# read in a weight matrix
# If you have a list of csv files, use wildcards, e.g.:
# "../persistant_data/od_matrices/*.csv"
path2csv = "../persistant_data/od_matrices/*.csv"
# threshold
# all values less than row_max / row_threshold will be set to zero
row_threshold = 10.
# --------------------------
list_of_files = glob.glob(path2csv)
for i, fi_rel in enumerate(list_of_files):
fi = os.path.abspath(fi_rel)
print(f"Processing: {fi}")
od_weights = pd.read_csv(fi).values
od_val_w = od_weights[:, 1:]
od_val_w = od_val_w.astype(np.float)
# weights ---> probability distributions for each row
od_val_w = od_val_w/np.sum(od_val_w, axis=1)[:, None]
for j in range(od_val_w.shape[0]):
row_threshold_adjust = np.max(od_val_w[j, :]) / row_threshold
od_val_w[j, od_val_w[j, :] < row_threshold_adjust] = 0
od_val_sparse = coo_matrix(od_val_w)
scipy.sparse.save_npz(os.path.basename(fi).split(".csv")[0] + ".npz", od_val_sparse)
If you prefer to have it as a function in daedalus, I can do it later today. Please let me know if I should clarify anything.
Thanks for that @kasra-hosseini, I have made the original probability matrices sparse with your code and uploaded them to the sharepoint: https://thealanturininstitute.sharepoint.com/:u:/r/sites/SPENSER/Shared%20Documents/sparse_OD_prob_matrcies.zip?csf=1&web=1&e=Dg4jo2
It reduced from 7GB to around 25MB!
Great! I will move that piece of code to daedalus then.
@ld-archer we changed the names of these files:
M_65to64_prob_matrix_EW.npz
F_65to64_prob_matrix_EW.npz
to:
M_65to74_prob_matrix_EW.npz
F_65to74_prob_matrix_EW.npz
We were wondering if this typo (64 instead of 74) was introduced by the above code or it is a typo in the IPF code?
The above code is now moved to: https://github.com/BenjaminIsaac0111/daedalus/blob/feature/refactoring_pipeline/daedalus/utils.py#L279
Blocked. Waiting for the new matrices.