Compare the two type of matrices (integerised and probabilistic)

kasra-hosseini commented 4 years ago

Blocked. Waiting for the new matrices.

kasra-hosseini commented 4 years ago

@ld-archer Thanks again for uploading the new OD matrices. Some of the files seem to be corrupted (?), see the errors below:

❯ unzip OD_prob_matrices.zip                                                                                                                                                                                                                                                                                                                                             ─╯
Archive:  OD_prob_matrices.zip
  inflating: F_0to4_prob_matrix_EW.csv   bad CRC a4b9cb8e  (should be 963bec50)
  inflating: F_16to19_prob_matrix_EW.csv
  inflating: F_20to24_prob_matrix_EW.csv
  inflating: F_25to34_prob_matrix_EW.csv
  inflating: F_35to49_prob_matrix_EW.csv
  inflating: F_50to64_prob_matrix_EW.csv
  inflating: F_5to15_prob_matrix_EW.csv
  inflating: F_65to64_prob_matrix_EW.csv
  inflating: F_75plus_prob_matrix_EW.csv
  inflating: M_0to4_prob_matrix_EW.csv
  error:  invalid compressed data to inflate
  inflating: M_16to19_prob_matrix_EW.csv
  inflating: M_20to24_prob_matrix_EW.csv
  inflating: M_25to34_prob_matrix_EW.csv
  inflating: M_35to49_prob_matrix_EW.csv
  inflating: M_50to64_prob_matrix_EW.csv
  inflating: M_5to15_prob_matrix_EW.csv
  inflating: M_65to64_prob_matrix_EW.csv
  error:  invalid compressed data to inflate
  inflating: M_75plus_prob_matrix_EW.csv

kasra-hosseini commented 4 years ago

By simply saving the new OD matrices in sparse representation, we can reduce the size by a factor of 3, but it is still considerably larger than the previous OD matrices.

# new OD matrix, CSV format
1.0G            F_16to19_prob_matrix_EW.csv
# new OD matrix, NPZ format
381M    F_16to19_prob_matrix_EW.npz
# old OD matrix, NPZ format
312K    F_16to19_OD_matrix_EW.npz

kasra-hosseini commented 4 years ago

@crangelsmith @ld-archer Here is the code to make OD matrices sparse:

# Make OD matrices sparse

import glob
import numpy as np
import os
import pandas as pd
from scipy.sparse import coo_matrix
import scipy

# ---- INPUTS --------------
# read in a weight matrix
# If you have a list of csv files, use wildcards, e.g.:
# "../persistant_data/od_matrices/*.csv"
path2csv = "../persistant_data/od_matrices/*.csv"

# threshold
# all values less than row_max / row_threshold will be set to zero
row_threshold = 10.
# --------------------------

list_of_files = glob.glob(path2csv)

for i, fi_rel in enumerate(list_of_files):
    fi = os.path.abspath(fi_rel)
    print(f"Processing: {fi}")

    od_weights = pd.read_csv(fi).values
    od_val_w = od_weights[:, 1:]
    od_val_w = od_val_w.astype(np.float)

    # weights ---> probability distributions for each row
    od_val_w = od_val_w/np.sum(od_val_w, axis=1)[:, None]

    for j in range(od_val_w.shape[0]):
        row_threshold_adjust = np.max(od_val_w[j, :]) / row_threshold
        od_val_w[j, od_val_w[j, :] < row_threshold_adjust] = 0

    od_val_sparse = coo_matrix(od_val_w)
    scipy.sparse.save_npz(os.path.basename(fi).split(".csv")[0] + ".npz", od_val_sparse)

If you prefer to have it as a function in daedalus, I can do it later today. Please let me know if I should clarify anything.

ld-archer commented 4 years ago

Thanks for that @kasra-hosseini, I have made the original probability matrices sparse with your code and uploaded them to the sharepoint: https://thealanturininstitute.sharepoint.com/:u:/r/sites/SPENSER/Shared%20Documents/sparse_OD_prob_matrcies.zip?csf=1&web=1&e=Dg4jo2

It reduced from 7GB to around 25MB!

kasra-hosseini commented 4 years ago

Great! I will move that piece of code to daedalus then.

kasra-hosseini commented 4 years ago

@ld-archer we changed the names of these files:

M_65to64_prob_matrix_EW.npz F_65to64_prob_matrix_EW.npz

to:

M_65to74_prob_matrix_EW.npz F_65to74_prob_matrix_EW.npz

We were wondering if this typo (64 instead of 74) was introduced by the above code or it is a typo in the IPF code?

kasra-hosseini commented 4 years ago

The above code is now moved to: https://github.com/BenjaminIsaac0111/daedalus/blob/feature/refactoring_pipeline/daedalus/utils.py#L279

BenjaminIsaac0111 / daedalus

Compare the two type of matrices (integerised and probabilistic) #19