Run TEPs-I - Githubissues

cczhu commented 5 years ago

Run TEPs-I to predict volumes, and retrain TEPs-I to see if we can reproduce the same predictions.

MATLAB should not be required here.

cczhu commented 5 years ago

After acquiring MATLAB 2016r2, and installing Gnuplot 5.2.2 and R 3.6.1 (and RStudio), can now run access the TEPs-I GUI under KCOUNT/codes/App2.mlapp. Not sure if it's supposed to be here. More worryingly, attempting to run the model using the GUI with the default settings (except Working Directory, which is C:\Users\czhu5\Documents\VolumeModel\TEPS-dev\) leads to an error message saying that nansum is missing:

Error using STTC_estimate3 (line 116)
Undefined function 'nansum' for input arguments of type 'double'.

Error in main_DoM_new_2012 (line 30)
        parfor iyear=Start_year:End_year

Error in main_combined_2 (line 50)
                    main_DoM_new_2012(str1,direction{idir},FY.Value,EY.Value,base_year,allyearindex.Value,ishort_krig);

Error in App2/EstimateAADTsButtonPushed (line 322)
                main_combined_2(app.WorkingDirectoryEditField,option, ...

Error in appdesigner.internal.service.AppManagementService/tryCallback (line 207)
                    callback(app, event);

Error in
matlab.apps.AppBase>@(source,event)tryCallback(appdesigner.internal.service.AppManagementService.instance(),app,callback,requiresEventData,event) 
Error using matlab.ui.control.internal.controller.ComponentController/executeUserCallback (line 262)
Error while evaluating Button PrivateButtonPushedFcn

nansum is part of the Statistics and Machine Learning add-on to MATLAB, which costs $1000 USD...

cczhu commented 5 years ago

Initialized a local WSL git repo under TEPS-dev to track any changes I make to the code. Will add various files to the .gitignore as needed.

cczhu commented 5 years ago

Jan Glscher uploaded a version of nansum in his NaN Suite. Imported this suite into charles/nansuite and appended the path under main_combined_2.m. Result:

Error using STTC_estimate3 (line 125)
Subscripted assignment dimension mismatch.

Error in main_DoM_new_2012 (line 30)
        parfor iyear=Start_year:End_year

Error in main_combined_2 (line 53)
                    main_DoM_new_2012(str1,direction{idir},FY.Value,EY.Value,base_year,allyearindex.Value,ishort_krig);

Error in App2/EstimateAADTsButtonPushed (line 322)
                main_combined_2(app.WorkingDirectoryEditField,option, ...

Error in appdesigner.internal.service.AppManagementService/tryCallback (line 207)
                    callback(app, event);

Error in
matlab.apps.AppBase>@(source,event)tryCallback(appdesigner.internal.service.AppManagementService.instance(),app,callback,requiresEventData,event) 
Error using matlab.ui.control.internal.controller.ComponentController/executeUserCallback (line 262)
Error while evaluating Button PrivateButtonPushedFcn

which suggests we can't just replace whatever version of nansum Arman used with this one.

cczhu commented 5 years ago

Different tack - try running TEPs-I's executable using Matlab 2016b (Update 6) runtime downloaded from MathWorks. Created TEPS-exerun to run this, since I don't know what intermediate files will be created by the executable that will ruin my testing from above. Will run the exe overnight using all default settings (since the exe is in the root directory I don't even have to change the working directory).

cczhu commented 5 years ago

TEPS-exerun is successfully able to return a series of diagnostic graphs from PRTCS:

but soon after also plotting the KCOUNT diagnostic figures (manual Figs. C-1 - C-3) suddenly closes all figures and returns this:

Not sure what version of TEPs-I the exe is built from, since the only uncommented msgbox('Error: Model need revision!') in App2.mlapp is under function ARIMAButtonPushed, used for PECOUNT-I.

cczhu commented 5 years ago

Deployed a temporary GitHub repo to house TEPs-I's original source code, mainly as a backup for local files and to communicate with Arman as needed. I'm NOT including it under bdit_teps because:

There are definitely redundant files scattered throughout the repo.
Not all files required to run TEPs are included, including parameter files (eg. locations of count stations). Including any of these files at all would balloon the repo to well above several hundred megabytes.
TEPs doesn't run on our systems yet, and it is unknown why.

2021-07-13 update - I deleted the temporary repo, to avoid any confusion from future GitHub spelunkers. Copies of TEPS are available on L: drive.

cczhu commented 5 years ago

There's a program dependency report generator! Dumped reports in PDF form into DependencyReports folder.

cczhu commented 5 years ago

TEPs Dependencies

Here are all the functions and objects that could not be identified by MATLAB's dependency reports.

Emission

boxcox - Financial Toolbox; Box-Cox transformation
gregnet2b - included as a .mat
nanmean - Statistics and Machine Learning Toolbox; mean, remove NaN first
nansum - Statistics and Machine Learning Toolbox; sum, remove NaN first
newff - Deep Learning Toolbox; deprecated version of feedforwardnet
plotregression - Deep Learning Toolbox; plot linear regression
regstats - Statistics and Machine Learning Toolbox; regression diagnostics
sim - Deep Learning Toolbox; simulate NN
train - Deep Learning Toolbox; train NN

export_fig

imresize: Image Processing Toolbox; resizes image

KCOUNT

cholcov: Statistics and Machine Learning Toolbox; Cholesky-like covariance decomposition
corr: Statistics and Machine Learning Toolbox; pairwise linear correlation coefficient
dataset: Statistics and Machine Learning Toolbox; construct dataset array
dummyvar: Statistics and Machine Learning Toolbox; create dummy variable
fitlm: Statistics and Machine Learning Toolbox; fit linear regression model to dataset array
gcp: Parallel Computing Toolbox
grpstats: Statistics and Machine Learning Toolbox; summary statistics organized by group
mat2dataset: Statistics and Machine Learning Toolbox; convert matrix to dataset
nansum - Statistics and Machine Learning Toolbox; sum, remove NaN first
nominal: Statistics and Machine Learning Toolbox; discrete, nonumeric values
predict: Statistics and Machine Learning Toolbox; predict with fitlm
pDist: UNKNOWN! (Though part of find_ids_for_pred, which appears to be unused)
quantile: Statistics and Machine Learning Toolbox; quantiles of a data set
weights: UNKNOWN!

LocalSVR

All good.

OptimStation

corr: Statistics and Machine Learning Toolbox; pairwise linear correlation coefficient
dataset: Statistics and Machine Learning Toolbox; construct dataset array
dummyvar: Statistics and Machine Learning Toolbox; create dummy variable
fitscalingprop: Global Optimization Toolbox; genetic algorithm option
fitlm: Statistics and Machine Learning Toolbox; fit linear regression model to dataset array
fminsearchbnd: UNKNOWN! (Though the documentation says it'll default to fminsearch, which we do have)
gaplotbestf: Global Optimization Toolbox; genetic algorithm best fit plot
grpstats: Statistics and Machine Learning Toolbox; summary statistics organized by group
mat2dataset: Statistics and Machine Learning Toolbox; convert matrix to dataset
nansum - Statistics and Machine Learning Toolbox; sum, remove NaN first
nominal: Statistics and Machine Learning Toolbox; discrete, nonumeric values
optimoptions: Global Optimization Toolbox; genetic algorithm options
parcluster: Parallel Computing Toolbox
parpool: Parallel Computing Toolbox
predict: Statistics and Machine Learning Toolbox; predict with fitlm
pDist: UNKNOWN!
quantile: Statistics and Machine Learning Toolbox; quantiles of a data set
saveProfile: Parallel Computing Toolbox
selectionstochasticuniform: Global Optimization Toolbox; genetic algorithm option
selectiontournament: Global Optimization Toolbox; genetic algorithm option
selectionuniform: Global Optimization Toolbox; genetic algorithm option

PECOUNT

All good.

PRTCS

cleanUpUrl - UNKNOWN!
corr: Statistics and Machine Learning Toolbox; pairwise linear correlation coefficient
gcp: Parallel Computing Toolbox
nanmean - Statistics and Machine Learning Toolbox; mean, remove NaN first
nansum - Statistics and Machine Learning Toolbox; sum, remove NaN first

From this it looks like there's no way we can run the Emission or OptimStation modules without the Global Optimization, Financial, Deep Learning and Statistics/ML Toolboxes, though we weren't planning on doing that anyway. We cannot run KCOUNT without the Statistics/ML Toolbox, though. We could probably run PRTCS if Arman could dump corr, nanmean and nansum and send them over.

The UNKNOWN!s will need to be investigated further.

cczhu commented 5 years ago

Name	Used In	Cost (USD for Annual License)	Necessary?
Statistics and Machine Learning Toolbox	Emission, KCOUNT, OptimStation, PRTCS	400	Yes
Parallel Computing Toolbox	KCOUNT, OptimStation, PRTCS	400	Maybe
Deep Learning Toolbox	Emission	500	No
Global Optimization Toolbox	OptimStation	400	No
Financial Toolbox	Emission	740	No (maybe Arman can just send it to us)
Image Processing Toolbox	export_fig	400	No

aharpalaniTO commented 5 years ago

Unfortunately for the POA it looks like we're going to need the same process as for C4C (briefing note + division head and PMMD Director signatures). There must be an easier way.

cczhu commented 5 years ago

Going back to running the executable, Arman noted that we need to add R to the path environmental variable in order to run Rscript (LocalSVR/codes/build_infile_SVR.m`). I added this.

Then tried to run only PRTCS. This was successful:

The File not found: run KCOUNT for both directiosn and check working path comes from Emission/codes/pos_neg_sum.m, and comes from not running KCOUNT (there's an equivalent error for not running LocalSVR).

Running both PRTCS and KCOUNT leads to "Error: Model need revision!".

cczhu commented 5 years ago

Moving back to the Matlab 2016 environment, now with all the relevant add-ons listed here, we can successfully run PRTCS:

Running KCOUNT afterward leads to:

Error using calcVarMat (line 19)
Index exceeds matrix dimensions.

Error in main_2_2012_min (line 377)
varmat = calcVarMat(kDist, residv);

Error in main_combined_2 (line 60)
                    main_2_2012_min(str2,direction{idir},strcat(path.Value,'KCOUNT\RMsma_2km_neg\'),base_year,lam,bins);

Error in App2/EstimateAADTsButtonPushed (line 322)
                main_combined_2(app.WorkingDirectoryEditField,option, ...

Error in appdesigner.internal.service.AppManagementService/tryCallback (line 207)
                    callback(app, event);

Error in matlab.apps.AppBase>@(source,event)tryCallback(appdesigner.internal.service.AppManagementService.instance(),app,callback,requiresEventData,event)

Error using matlab.ui.control.internal.controller.ComponentController/executeUserCallback (line 262)
Error while evaluating Button PrivateButtonPushedFcn

This error comes from a mismatch between the shape of kDist, 5104 x 5104, and residv, 4982 x 1. From reading main_2_2012_min.m the former is loaded by the line in preProcDist_new.m

kIDs1 = csvread([path name '_obs_' num2str(base_year) '.txt'])

while the latter is populated from kMat2, which is read from

kMat2=csvread(strcat(path,'data_for_fit',num2str(base_year),'.txt'),1);

(At least for 2011) the former is updated (PRTCS/output_for_kriging2011negative/data_for_fit2011.txt, PRTCS/output_for_kriging2011positive/data_for_fit2011.txt and their copies under KCOUNT/RMsma_2km_neg and KCOUNT/RMsma_2km_pos, respectively). The latter (nominally PRTCS/output_for_kriging2011negative/ids_obs_2011.txt and PRTCS/output_for_kriging2011positive/ids_obs_2011.txt) is not. Looks like we have to manually trigger ishort_krig.Value, which is toggled by the app.ShortestpathreanalysisCheckBox boolean, which is passed to main_combined_2 by the app.

cczhu commented 5 years ago

TEPs-I raw volume data:

SELECT * FROM prj_volume.uoft_centreline_volumes_output
LIMIT 100

cczhu commented 5 years ago

Can now run PRTCS, KCOUNT and LocalSVR in sequence! See GUI panel for settings.

cczhu commented 5 years ago

Currently can also run PECOUNT-I and PECOUNT-II, but uncertain of which stations Arman flags as Aggregate Stations, and which medium term counts he selects as downstream stations for each aggregate one.

We do know (but haven't run) that the CSV dump of prj_volume.uoft_centreline_volumes_output is parsed into individual station counts using the UNIX script Arman included in the TEPs Manual appendix. This should be fairly straightforward to reproduce using Python (or even Postgres).

cczhu commented 5 years ago

~~Close examination of the output of PRTCS suggests a bug: for output_PRTCS_2011_negative (and positive), permanent station AADTs are found in Perm_AADT_2011.txt and temporary ones in Temp_AADT_2011.txt. Station ID is the centreline ID extracted from uoft_centreline_volumes_output. IDs in Perm_AADT_2011.txt are found in Temp_AADT_2011.txt. Permanent and temporary stations are distinguished in lines 161-178 of PRTCS/codes/main_DoM_new_2012.m - all stations are under Ms_abs, while temporary and permanent stations should be separated into DoM_PTC and DoM_STTC, respectively. However, the number of unique centreline IDs in Ms_abs(:,4) is identical to the number under DoM_STTC(:,5) (or MSE(:,2), which is copied from DoM_STTC(:,5)).

Since this bug was only discovered by a close inspection of the code, but a close inspection was also required to fully understand the columns being output by each TEPs module, this suggests we should be converting each module in sequence (starting with PRTCS) rather than using a top-down approach of building the entire test suite out first before beginning the conversion process. This way we can read the code in detail and find any bugs before moving onto the next step. We can still create canned inputs using either Arman's original code or my revised one to feed into another module of TEPs to create test outputs.~~

Update: this is probably just the way Arman names things. As far as I can tell whether a station is permanent is checked by its centreline ID and year within the code, and in some cases permanent count stations are deliberately duplicated across multiple files or variables so they can be used for validation or error estimates.

In any case I've found far worse issues with PRTCS.

cczhu commented 5 years ago

Annoying find - PRTCS/codes/data_prep_kridging.m includes land use data, station counts, and possibly has a different definition of which stations have AADTs (line 161). I'm not certain why it's even part of PRTCS, since it data_prep_kridging.m doesn't appear to ingest any data from PRTCS/output_PRTCS_<YEAR>_<DIRECTION>. Perhaps there's an undocumented intermediate step that creates shortest_path.zip which also incorporates the station count data?

Update: this isn't quite true - the land use data was determined separately, but the AADT estimates do come from PRTCS. See the flow chart and description in the wiki.

cczhu commented 5 years ago

Following discussion with Jesse and Aakash:

Check that we're able to compile and run PECOUNT-I and PECOUNT-II on a Linux system using a Python control scheme.
Set up a meeting with Arman within the next two weeks. We should get him to answer the questions in #3.
Goals for the handoff should be:
- Reproducing two years of aadt_output_files/final_aadt_<YEAR>.csv to acceptable accuracy, starting from the PRTCS/<POS_OR_NEG>/15min_counts<YEAR>.zip files.
- Determining how to edit files to incorporate new permanent and temporary count stations or to revise land use data, etc. This should answer how to generate some of the inputs to later-stage modules like LSVR.
- Determining how to produce PECOUNT-I inputs.
Create a revised work plan following the meeting with Arman.

cczhu commented 5 years ago

With Arman's assistance in person, discovered that I was missing some files under KCOUNT\RMsma_2km_pos (that is on OneDrive, just not downloaded locally). Some notes from our meeting:

Because Arman's computer required connecting to OneDrive to download certain files on his version of TEPs, we were unable to make a complete hard copy of the code. He's offered to meet me again at UofT campus to transfer a properly synced copy of TEPs if we find a significant number of missing files.
PECOUNT is NOT run to create the PRTCS/<POS_OR_NEG>/15min_counts<YEAR>.zip folders - these are created simply by dividing up our raw data. Arman's tested the module's predictive accuracy (see his TRC_2018_1199 paper), but hasn't tried running the remainder of TEPs with augmented data.
15min_counts<YEAR>.zip files have a dummy number in their filenames because Arman was, for another project, working with multiple sensors per centreline segment. Likewise the first column of each txt file within each zip is a dummy column from the Unix shell script.
re<CENTRELINE>_<DUMMY>.txt inside 15 min count data comes from HW401. Except for the difference in filename they're formatted in the same way as the other files.
max_counts and min_counts, set in lines 113 - 296 of KCOUNT/codes/main_2_2012_min.m, are not used by KCOUNT, and so can be ignored. They were originally limiters on the number of counts per road type.
The 41 road types do not come from City data.
Arman will provide in-line comments for data_prep_kriging.m over e-mail.
Arman will provide data dictionaries and instructions on how to create any additional intermediate files I'm confused by over e-mail. I will send him a file of filenames for him to explain.
Arman recommends against re-creating shortest_path.zip, but is willing to help us do this.
He recommends the following run to check that TEPs can run end-to-end:

Action items for me:

E-mail Arman a revised set of questions.
Monitor a TEPs run with the above settings, and e-mail Arman if anything fails.

cczhu commented 5 years ago

The above run has successfully completed, and we've generated a new set of final_aadt_<YEAR>.csv files. They're not identical to the ones from Arman's OneDrive (in some cases even exceeding the 95% CI estimates provided). I've contacted Arman about this and will update this issue once he responds.

cczhu commented 4 years ago

Odd issue I discovered while running TEPS-dev - some of the PTC IDs in validation_2010.txt files produced by TEPS-exerun are not in TEPS-dev. The missing counts are not included in test_id_negative and test_id_positive in STTC_estimate3.m, so weren't eliminated by data preprocessing. I can't tell if this is an issue with the executable in TEPS-exerun being different than the source code, but at this stage of our development it's probably not fruitful to investigate.

cczhu commented 4 years ago

Further investigation reveals that this is because TEPS-dev was running on 2006-2013 data, and TEPS-exerun was running on 2006-2016. The additional PTCs from 2013-2016 lead to outlier points on the Observed-Predicted plot:

This lends further credence to the discussion in #14 on validation issues.

cczhu commented 4 years ago

A couple of notes on how to run TEPs learned while completing #41:

"Year of analysis" is the year for output AADTs - selecting "All years" produces outputs for all years where input data is available.
"Estimate AADTs" and "Estimate Vehicle speed" actually call the same main_combined_2 code in the backend. "Estimate AADTs" explicitly unchecks the "Vehicle speed" box, while "Estimate Vehicle speed" unchecks the "PRTCS", "KCOUNT", and "LocalSVR" boxes.
"Zip files for TEPS-II" indiscriminately copies Emission/inputs/*.csv and Emission/outputs/*.csv into EED/inputs/ and EED/outputs/ folders.

CityofToronto / bdit_traffic_prophet

Run TEPs-I #4

TEPs Dependencies

Emission

export_fig

KCOUNT

LocalSVR

OptimStation

PECOUNT

PRTCS