Paper Reproducibility Changes

Abby-Wheelis commented 9 months ago

As I am going through the charts in the paper to polish them up, I am also taking the time to organize, document, and check in the code used to produce those results. This maintains transparency for future researchers who might want to reproduce our results.

Added a DataFiltering Notebook:

This script compiles the data treatments performed by Cemal in his analysis notebook that generated many of the charts used in the paper.
The purpose of this script is to facilitate reproducibility of the results in our paper by taking in the raw set of trips in CSV, and applying all data treatments, then saving the results to be loaded into the analysis notebooks
Note that I have not yet had a chance to be sure that this works on the data from TSDC, but it does yield the numbers we quote in terms of participants and trips on an aggregate and program level when run in the raw file Cemal gave me

Analysis Notebooks: planning for 1 with non-spatial data and 1 to work with spatial data, will update as I make these changes, as the plan may change depending on the data formats.

Another note: much of this code is coming from a previous researcher who worked on the paper, Cemal Akcicek. My work is focused on organizing and polishing.

Abby-Wheelis commented 9 months ago

This is ending up being VERY tricky and confusing. The goal is to have the results and charts we show in the paper be 100% reproducible from the TSDC data -- open source data and script to allow for full transparency and reproducibility. This will not only benefit the credibility of this paper, but will hopefully lay the groundwork to make analysis of other, future, OpenPATH programs archived in the TSDC easy and accessible.

However, the data originally used to generate the paper is not the same as what TSDC will be providing. The column names in the csvs are almost all different, and the TSDC is redacting a fair bit of information. The problematic columns being redacted (so far) include Age (used for some analysis of the affect of age on e-bike usage) and trip timestamps (a key datapoint in calculating when someone's first e-bike trip was, and then cleaning out all data before that point).

I've been spinning my wheels for a little more than a week now, trying to reverse engineer a way to get the data from the TSDC cleaned and filtered in such a way that it matches the process we outline in the paper and the numbers that yielded. I'm hoping it will help to write out my thought processes a little bit more, which I've been doing in a bit more scattered manner but should really include here.

If I'm able to get the data to "match", the next hurdle is that a fair number of the charts rely on the data being loaded into the database, which I personally can't do because my computer can't process that much data at once without completely wigging out. But even if I could (there are some parts of the dataset that I won't need for this highish-level analysis, so I could clean them out and try loading a smaller subset of data) -- that's not the format that TSDC is providing, so then I'll need to A) reconcile the chart generating tactics with different data sourcing or B) wrestle the "matched" data into a zipped file compatible with the database loading

I'll keep updating here with my thoughts and progress as I have / continue to try different things

Abby-Wheelis commented 9 months ago

One issue I've encountered is the lack of Mode_confirm, etc columns in the TSDC data. @shankari pointed out that these columns come from the dictionaries in viz_scripts/auxillary_files and are leveraged by viz_scripts/scaffolding.py. I examined the code in scaffolding.py and ended up scraping the following lines from that file in order to use them in the data cleaning process I'm trying to develop:

#first, add the cleaned mode
data['Mode_confirm']= data['data_user_input_mode_confirm'].map(dic_re)

#second, add the cleaned replaced mode ASSUMES PROGRAM
data['Replaced_mode']= data['data_user_input_replaced_mode'].map(dic_re)

#third, add the cleaned purpose
data['Trip_purpose']= data['data_user_input_purpose_confirm'].map(dic_pur)

I did not just use the functions in scaffolding.py because they assume the database paradigm, and I am working from csvs

shankari commented 9 months ago

I am fine with this for now, but we should revisit this whole mapping when we re-do the energy/emissions work.

In general, we should implement scaffolding as an abstract class with two concrete subclasses. We already use something similar for the abstract timeseries, and since we use the base image anyway, we can also switch to using that standard approach and adding a new implementation instead of having 5 different implementations for data access.

Abstractions FTW!

Abby-Wheelis commented 9 months ago

Another issue, is the numbers not matching up. When I read in the TSDC data, I read it in program by program, matching the confirmed trips with the sociodemographic data then concatenating the merged data into the dataframe so that I end up with all of the merged data together. But the number of users in this dataset is less than I would expect.

As I'm accumulating the data, the programs have 13, 47, 29, 14, 14, and 9 users, for a total of 126 users, seems a little low for only having merged sociodemographic data and done no other cleaning, but is still > 122 so OK. But after all the data is put together there are only 112 unique IDs. This is a problem.

I'm worried that some of the users in different programs ended up with the same random ID as a result of the data cleaning process. I'm going to work on a way to prevent this - maybe appending the program name to the beginning of the id before I add that program to all the other programs.

Abby-Wheelis commented 9 months ago

I'm worried that some of the users in different programs ended up with the same random ID

Yep, I think this is exactly what happened, after adding the program to the end of id, the number after compiling the program when from 112 to the accurate 126. One step closer to finding the right data cleaning process!

BUT after just part of the filtering we're down to 118 users ... maybe the socio merging is off just a little somehow?

Abby-Wheelis commented 9 months ago

Screenshot 2023-12-06 at 12 21 13 PM Adding more print statements revealed that there were 15 unique ids in the surveys, 14 unique ids in the trips, but then only 13 once they were merged. That continues for the other programs, dropping just 1-2 users per program. I wonder how that could be happening?

1) ghost users - people that logged in and filled a survey but never took any trips 2) errors in the merging code ... I feel like if there were id-matching issues the gap would be more than the 5ish total users

Best guess would be ghost users, but keeping this in mind.

Abby-Wheelis commented 9 months ago

Wait, ok, so the bigger concern is the dip between the number of users that have trips and the number of users that have trips and a survey -- you can't enter the app without filling out a survey, even if you say "wish not to say" for all of the responses, we still have a survey record for you.

@shankari is this correct? If so, then we might have a bigger problem because there are some places where, just reading in the csvs, there are less entries in the surveys csv than there are unique users in the trip dataset. I don't feel like that should be possible, should it? Screenshot 2023-12-06 at 12 53 44 PM

This is saying there were 12 unique users in vail, 11 entries in the survey list, down to 9 after deduplication, so the number of vail users got dropped from 12 to 9 because 3 did not have a survey entry.

Abby-Wheelis commented 9 months ago

Yesterday I discovered that (at least part) of the problem was leading/trailing whitespaces on some of the userids, which got rid of the problem where I was randomly dropping users in the trip-survey merging process. I'm still working with the TSDC and my data cleaning scripts to verify an equivalent process of preparing the data to that which we used in the paper, when starting with TSDC data

Abby-Wheelis commented 8 months ago

New status update heading into the holidays:

My TSDC data work file intakes the files that the TSDC will have, and output similar numbers (same number of users, off by around 1,000 trips). TSDC data does still have some issues that I can ee:

pueblo county seems to be missing around 300 e-bike trips
community cycles has some messy data - off by one column messing up 8-10 trips

My Analysis script has almost all of the charts generated now:

added the labeling rate charts - reading from the TSDC now!
added more charts from Cemal's notebooks to centralize what's in the paper
spatial charts in their own notebook - not running
trip mode splits over distance not working great
timelines also somewhat broken - could be the data mistakes in CC - affecting timestamps
emissions charts I have not yet copied over

Left to address:

[ ] lingering TSDC issues [should have fixed data in the first few days of 2024]
[x] spatial charts (2) [Denver's is a little off still]
[x] mode charts (2)
[ ] time charts (2) [too many axis labels, and does not match the paper plot (could be the data)]
[x] emissions charts (2) [waiting to see if data matches]

Abby-Wheelis commented 8 months ago

Update on the TSDC data: after some back and forth, the TSDC has been able to provide me with a dataset which, after following the same filtering and cleaning procedure we used for the paper, produces ALMOST the exact same dataset. There are the exact same number of users, but 3 extra trips. Other things of note:

most of the programs are up/down by 1-2 trips (could come down to slight differences data filtering?)
most of the charts look exactly or almost exactly the same as the ones in the paper
except for the mileage-over-time and proportion-over-time graphs: chaotic x axis labels (too many, all overlap) and the line itself does not match, work to be done here
there are some demographics added - ie TSDC is reflecting people with 100,000-149,999 incomes and 150,000-199,999 incomes (maybe one of each?)
spatial data is still a little wonky, as I'm having a hard time reading it from the TSDC data due to column naming / data formatting differences, this also needs work

Other than the spatial charts, two time charts that are messed up, and the extra income brackets (and the ramifications of that data in other charts), things are lining up almost exactly. I'd say we're up to at least 85% reproducible with the code I have checked in

As far as structure goes, I think we need to make a decision before we commit this - I currently have "Abby" and "Cemal" folders, since much of what I ended up doing was copying Cemal's work out of his various files and into my own files to do things like contain data cleaning to its own file, and then I have other files for the energy, spatial, and general analysis code -- so there's really 4 files that are needed for reproducing the paper from the TSDC data. I'd lean towards only keeping the cleaned-up versions of the files we need for merge, but I wanted to open it for discussion, since there are parts of Cemal's files that did not make it into mine.

Abby-Wheelis commented 8 months ago

For the timeline charts, I was able to resolve the axis labels this morning, but the values still seem wrong, the total mileage appears much higher than what was put in the paper, as does the e-bike proportions, I'm unsure what's happening, but am getting some warnings, will look into those next.

To demonstrate: blue is what I have now, red is in the paper

Update: fixed the warnings by narrowing down columns grouping then summing where it was grouping, summing and then narrowing (the warning was about summing non-numerical columns). However, this has not fixed the values.

My next hunch is that maybe we used the full data for these charts last time instead of just the labeled data? I could see where there might be a bias to label longer trips... or especially where people labeled e-bike trips a lot so the elevated e-bike proportion I'm seeing could be a response bias?

Using all the trips did not help, as moving away from using my filtered trip csv meant that outliers like long trips were once again included, yeilding a chart like this:

Abby-Wheelis commented 8 months ago

Another remaining mystery is the Denver part of the spatial data, I'm only using data from smart commute, but the chart is still evaluating more areas (with little car travel) than the paper does:	TSDC Data Notebook	Paper

But the good news is that spatial notebook runs with the TSDC data, and the rest of the programs are accurate!

Update: I thought I was only using Smart Commute data, but I was wrong. I have now updated the script to use only smart commute data for denver, and things are much closer, see the updated charts above. The 0.5% difference is confusing, but doesn't feel significant enough to get wrapped up in chasing.

Abby-Wheelis commented 8 months ago

This is getting close to wrapping, remaining tasks:

the mileage-over-time charts
the demographic differences (1-2 individuals with higher incomes) in TSDC data not in the paper
final decision on structure (only keep the scripts directly tied to TSDC data?)

Smaller details that we could look into if we have time:

programs being up/down by 1-2 trips
the 0.5% difference in the denver e-bike vs car pixels
some of the formatting does not match the paper exactly - thinks like color, font size (does not prevent reproducibility)

Abby-Wheelis commented 8 months ago

the values still seem wrong, the total mileage appears much higher than what was put in the paper, as does the e-bike proportions, I'm unsure what's happening, but am getting some warnings, will look into those next.

Ah, ok, so I think I finally figured this out. I had a hunch that maybe we weren't counting "0" days for users, and that turned out to be exactly what it was, we were dropping all user-date combos where there were no trips (labeled trips, that is). When I modified the code to enforce using ALL combinations of users and dates -- I reproduced the chart! Will update with the e-bike chart if I get that one working as well

Abby-Wheelis commented 8 months ago

Trying the enforcement of all user and date combinations with e-bike trips is not the solution, what I'm getting is lower than it should be, and seems a lot to just be the shape of the other chart ... so I'm not sure where I'm going wrong there.

shown here from top to bottom: with enforcing all combinations, without, and the copy used in the paper

That middle chart looks like the right shape ... but why is it so elevated?

Abby-Wheelis commented 8 months ago

Finally able to reproduce this one! As it turns out, we do need to count all combinations, but then drop the rows with no travel that day to get the correct averages! I'm guessing 0 e-bike miles / 0 total miles was counted as 1, which would explain the artificial elevation!

Abby-Wheelis commented 8 months ago

Now reproducing all of the charts! There's a few quirks (see 2nd part of this comment) but overall in a good place with this.

Ready for review, particularly hoping for feedback wrt:

file structure (just keep the 4 "core" reproducing from TSDC data files?)
data quirks : trip counts ever so slightly off, additional income households

Abby-Wheelis commented 8 months ago

I have one more file that I want to update before we publish this, to get the more absolute paths out of my spatial analysis, but I can't get past import errors with geopandas that have popped up. I consistently get an error ImportError: libtiff.so.5: cannot open shared object file: No such file or directory coming from pyproj, which is a dependency of geopandas but I have not been able to resolve it. I'm not sure what got updated recently, but it seems like that has to be what created a major compatibility issue, since I was able to use this notebook just a couple of weeks ago

shankari commented 8 months ago

This looks like an FAQ with a couple of potential solutions https://github.com/search?q=+libtiff.so.5%3A+cannot+open+shared+object+file&type=code

Not sure why it suddenly broke unless you re-pulled from upstream and got the security upgrade to a more recent version of linux.

Abby-Wheelis commented 8 months ago

This looks like an FAQ with a couple of potential solutions https://github.com/search?q=+libtiff.so.5%3A+cannot+open+shared+object+file&type=code

Many of these suggest that I install libtiff5 manually with apt install libtiff5, but I don't have apt on my mac, and it seems the appropriate mac equivalent would be brew which I can't install without admin privileges.

I tried conda install anaconda::libtiff which is what seems to be the baseline for installing libtiff with conda, but no avail.

conda list produces that I now have geopandas[0.14.2], libtiff [4.6.0], and pyproj [3.6.1]

Reading release notes for libtiff, it seems that the most recent versions are stopping support for lots of features, so maybe I need to try an older libtiff version? I'll try different versions of libtiff and potentially rolling back my local version of this repo until I can find a point where geopandas works to identify what changed.

shankari commented 8 months ago

Many of these suggest that I install libtiff5 manually with apt install libtiff5, but I don't have apt on my mac, and it seems the appropriate mac equivalent would be brew which I can't install without admin privileges.

if you are running em-public-dashboard, you are running the scripts in a linux container. You should be able to use apt as root within the container.

Abby-Wheelis commented 8 months ago

if you are running em-public-dashboard, you are running the scripts in a linux container. You should be able to use apt as root within the container.

I had not realized that, it worked! Now over the geopandas hurdle, added a note in the notebook about the error and how to resolve, should be able to wrap up polishing this script now.

Abby-Wheelis commented 8 months ago

I was able to add a note about resolving that error, make the "paths as variables" update that I wanted to, and run the file without additional hiccups, should be good to go now!

shankari commented 7 months ago

@Abby-Wheelis I have a couple of comments to make this easier for me to review.

Can you check in the notebooks with outputs both included and excluded? - e.g. CanBikeCO_Analysis and CanBikeCO_Analysis_with_outputs. You can put the version with outputs into viz_outputs. That will make it easier to compare diffs between the code without getting lost in the outputs
The files under Abby include both your changes to the graphs to prepare for publication and the changes to use csv so that we can use the TSDC data. Do you have an intermediate snapshot of only the changes for publication? It would be easier for me to review in two phases: cemal's code -> paper code -> csv files

I've been holding this off since it is complex, and I hope to be able to finally merge it in the upcoming weeks

Abby-Wheelis commented 7 months ago

@shankari

Yes, I can check in the notebooks once with outputs and once without, that should be simple to do.
I don't think that I have an intermediate snapshot, but I could get one from the commit history, would you want that as a separate PR? I'm not sure that it would be 100% stable, at that point I was still mostly doing what I had to in order to get the paper ready, and went back and cleaned up later.
Would you like me to keep Cemal's code in this PR? I haven't cleaned it up at all, but it is what I worked off of for my scripts

Abby-Wheelis commented 7 months ago

I have a branch established now, with just my changes before I started working with the TSDC data: branch here

Let me know if it would be easiest for you if I submitted the "pre TSDC" branch as it's own PR, and then this one could be treated as a follow-on!

Abby-Wheelis commented 6 months ago

@shankari do you want to check Cemal's code (what I based my code on) into the repository or should I remove it? I'll carry whatever we decide across both PRs to keep it consistent.

Abby-Wheelis commented 5 months ago

Large chunk of refactoring now done on this branch, based on commentary from @iantei on #118, this PR is now much smaller, with the elimination of older code and significant reduction in duplicate code. Changes fall in three folders:

muni_boundaries - the shapefiles used for spatial analysis
Abby - notebooks with code for processing TSDC data and generating each of the charts in the paper
viz_outputs - the data processing, spatial, and analysis charting notebooks

e-mission / em-public-dashboard

Paper Reproducibility Changes #102