e-mission / em-public-dashboard

A simple and stupid public dashboard prototype.
BSD 3-Clause "New" or "Revised" License
1 stars 23 forks source link

Paper Reproducibility Changes #102

Open Abby-Wheelis opened 9 months ago

Abby-Wheelis commented 9 months ago

As I am going through the charts in the paper to polish them up, I am also taking the time to organize, document, and check in the code used to produce those results. This maintains transparency for future researchers who might want to reproduce our results.

Added a DataFiltering Notebook:

Analysis Notebooks: planning for 1 with non-spatial data and 1 to work with spatial data, will update as I make these changes, as the plan may change depending on the data formats.

Another note: much of this code is coming from a previous researcher who worked on the paper, Cemal Akcicek. My work is focused on organizing and polishing.

Abby-Wheelis commented 9 months ago

This is ending up being VERY tricky and confusing. The goal is to have the results and charts we show in the paper be 100% reproducible from the TSDC data -- open source data and script to allow for full transparency and reproducibility. This will not only benefit the credibility of this paper, but will hopefully lay the groundwork to make analysis of other, future, OpenPATH programs archived in the TSDC easy and accessible.

However, the data originally used to generate the paper is not the same as what TSDC will be providing. The column names in the csvs are almost all different, and the TSDC is redacting a fair bit of information. The problematic columns being redacted (so far) include Age (used for some analysis of the affect of age on e-bike usage) and trip timestamps (a key datapoint in calculating when someone's first e-bike trip was, and then cleaning out all data before that point).

I've been spinning my wheels for a little more than a week now, trying to reverse engineer a way to get the data from the TSDC cleaned and filtered in such a way that it matches the process we outline in the paper and the numbers that yielded. I'm hoping it will help to write out my thought processes a little bit more, which I've been doing in a bit more scattered manner but should really include here.

If I'm able to get the data to "match", the next hurdle is that a fair number of the charts rely on the data being loaded into the database, which I personally can't do because my computer can't process that much data at once without completely wigging out. But even if I could (there are some parts of the dataset that I won't need for this highish-level analysis, so I could clean them out and try loading a smaller subset of data) -- that's not the format that TSDC is providing, so then I'll need to A) reconcile the chart generating tactics with different data sourcing or B) wrestle the "matched" data into a zipped file compatible with the database loading

I'll keep updating here with my thoughts and progress as I have / continue to try different things

Abby-Wheelis commented 9 months ago

One issue I've encountered is the lack of Mode_confirm, etc columns in the TSDC data. @shankari pointed out that these columns come from the dictionaries in viz_scripts/auxillary_files and are leveraged by viz_scripts/scaffolding.py. I examined the code in scaffolding.py and ended up scraping the following lines from that file in order to use them in the data cleaning process I'm trying to develop:

#first, add the cleaned mode
data['Mode_confirm']= data['data_user_input_mode_confirm'].map(dic_re)

#second, add the cleaned replaced mode ASSUMES PROGRAM
data['Replaced_mode']= data['data_user_input_replaced_mode'].map(dic_re)

#third, add the cleaned purpose
data['Trip_purpose']= data['data_user_input_purpose_confirm'].map(dic_pur)

I did not just use the functions in scaffolding.py because they assume the database paradigm, and I am working from csvs

shankari commented 9 months ago

I am fine with this for now, but we should revisit this whole mapping when we re-do the energy/emissions work.

In general, we should implement scaffolding as an abstract class with two concrete subclasses. We already use something similar for the abstract timeseries, and since we use the base image anyway, we can also switch to using that standard approach and adding a new implementation instead of having 5 different implementations for data access.

Abstractions FTW!

Abby-Wheelis commented 9 months ago

Another issue, is the numbers not matching up. When I read in the TSDC data, I read it in program by program, matching the confirmed trips with the sociodemographic data then concatenating the merged data into the dataframe so that I end up with all of the merged data together. But the number of users in this dataset is less than I would expect.

As I'm accumulating the data, the programs have 13, 47, 29, 14, 14, and 9 users, for a total of 126 users, seems a little low for only having merged sociodemographic data and done no other cleaning, but is still > 122 so OK. But after all the data is put together there are only 112 unique IDs. This is a problem.

I'm worried that some of the users in different programs ended up with the same random ID as a result of the data cleaning process. I'm going to work on a way to prevent this - maybe appending the program name to the beginning of the id before I add that program to all the other programs.

Abby-Wheelis commented 9 months ago

I'm worried that some of the users in different programs ended up with the same random ID

Yep, I think this is exactly what happened, after adding the program to the end of id, the number after compiling the program when from 112 to the accurate 126. One step closer to finding the right data cleaning process!

BUT after just part of the filtering we're down to 118 users ... maybe the socio merging is off just a little somehow?

Abby-Wheelis commented 9 months ago

Screenshot 2023-12-06 at 12 21 13 PM Adding more print statements revealed that there were 15 unique ids in the surveys, 14 unique ids in the trips, but then only 13 once they were merged. That continues for the other programs, dropping just 1-2 users per program. I wonder how that could be happening?

1) ghost users - people that logged in and filled a survey but never took any trips 2) errors in the merging code ... I feel like if there were id-matching issues the gap would be more than the 5ish total users

Best guess would be ghost users, but keeping this in mind.

Abby-Wheelis commented 9 months ago

Wait, ok, so the bigger concern is the dip between the number of users that have trips and the number of users that have trips and a survey -- you can't enter the app without filling out a survey, even if you say "wish not to say" for all of the responses, we still have a survey record for you.

@shankari is this correct? If so, then we might have a bigger problem because there are some places where, just reading in the csvs, there are less entries in the surveys csv than there are unique users in the trip dataset. I don't feel like that should be possible, should it? Screenshot 2023-12-06 at 12 53 44 PM

This is saying there were 12 unique users in vail, 11 entries in the survey list, down to 9 after deduplication, so the number of vail users got dropped from 12 to 9 because 3 did not have a survey entry.

Abby-Wheelis commented 9 months ago

Yesterday I discovered that (at least part) of the problem was leading/trailing whitespaces on some of the userids, which got rid of the problem where I was randomly dropping users in the trip-survey merging process. I'm still working with the TSDC and my data cleaning scripts to verify an equivalent process of preparing the data to that which we used in the paper, when starting with TSDC data

Abby-Wheelis commented 8 months ago

New status update heading into the holidays:

My TSDC data work file intakes the files that the TSDC will have, and output similar numbers (same number of users, off by around 1,000 trips). TSDC data does still have some issues that I can ee:

My Analysis script has almost all of the charts generated now:

Left to address:

Abby-Wheelis commented 8 months ago

Update on the TSDC data: after some back and forth, the TSDC has been able to provide me with a dataset which, after following the same filtering and cleaning procedure we used for the paper, produces ALMOST the exact same dataset. There are the exact same number of users, but 3 extra trips. Other things of note:

Other than the spatial charts, two time charts that are messed up, and the extra income brackets (and the ramifications of that data in other charts), things are lining up almost exactly. I'd say we're up to at least 85% reproducible with the code I have checked in

As far as structure goes, I think we need to make a decision before we commit this - I currently have "Abby" and "Cemal" folders, since much of what I ended up doing was copying Cemal's work out of his various files and into my own files to do things like contain data cleaning to its own file, and then I have other files for the energy, spatial, and general analysis code -- so there's really 4 files that are needed for reproducing the paper from the TSDC data. I'd lean towards only keeping the cleaned-up versions of the files we need for merge, but I wanted to open it for discussion, since there are parts of Cemal's files that did not make it into mine.

Abby-Wheelis commented 8 months ago

For the timeline charts, I was able to resolve the axis labels this morning, but the values still seem wrong, the total mileage appears much higher than what was put in the paper, as does the e-bike proportions, I'm unsure what's happening, but am getting some warnings, will look into those next.

To demonstrate: blue is what I have now, red is in the paper

image image

Update: fixed the warnings by narrowing down columns grouping then summing where it was grouping, summing and then narrowing (the warning was about summing non-numerical columns). However, this has not fixed the values.

My next hunch is that maybe we used the full data for these charts last time instead of just the labeled data? I could see where there might be a bias to label longer trips... or especially where people labeled e-bike trips a lot so the elevated e-bike proportion I'm seeing could be a response bias?

Using all the trips did not help, as moving away from using my filtered trip csv meant that outliers like long trips were once again included, yeilding a chart like this: image

Abby-Wheelis commented 8 months ago
Another remaining mystery is the Denver part of the spatial data, I'm only using data from smart commute, but the chart is still evaluating more areas (with little car travel) than the paper does: TSDC Data Notebook Paper

But the good news is that spatial notebook runs with the TSDC data, and the rest of the programs are accurate!

Update: I thought I was only using Smart Commute data, but I was wrong. I have now updated the script to use only smart commute data for denver, and things are much closer, see the updated charts above. The 0.5% difference is confusing, but doesn't feel significant enough to get wrapped up in chasing.

Abby-Wheelis commented 8 months ago

This is getting close to wrapping, remaining tasks:

Smaller details that we could look into if we have time:

Abby-Wheelis commented 8 months ago

the values still seem wrong, the total mileage appears much higher than what was put in the paper, as does the e-bike proportions, I'm unsure what's happening, but am getting some warnings, will look into those next.

Ah, ok, so I think I finally figured this out. I had a hunch that maybe we weren't counting "0" days for users, and that turned out to be exactly what it was, we were dropping all user-date combos where there were no trips (labeled trips, that is). When I modified the code to enforce using ALL combinations of users and dates -- I reproduced the chart! Will update with the e-bike chart if I get that one working as well

Abby-Wheelis commented 8 months ago

Trying the enforcement of all user and date combinations with e-bike trips is not the solution, what I'm getting is lower than it should be, and seems a lot to just be the shape of the other chart ... so I'm not sure where I'm going wrong there.

shown here from top to bottom: with enforcing all combinations, without, and the copy used in the paper

image image image

That middle chart looks like the right shape ... but why is it so elevated?

Abby-Wheelis commented 8 months ago

Finally able to reproduce this one! As it turns out, we do need to count all combinations, but then drop the rows with no travel that day to get the correct averages! I'm guessing 0 e-bike miles / 0 total miles was counted as 1, which would explain the artificial elevation!

image

Abby-Wheelis commented 8 months ago

Now reproducing all of the charts! There's a few quirks (see 2nd part of this comment) but overall in a good place with this.

Ready for review, particularly hoping for feedback wrt:

Abby-Wheelis commented 8 months ago

I have one more file that I want to update before we publish this, to get the more absolute paths out of my spatial analysis, but I can't get past import errors with geopandas that have popped up. I consistently get an error ImportError: libtiff.so.5: cannot open shared object file: No such file or directory coming from pyproj, which is a dependency of geopandas but I have not been able to resolve it. I'm not sure what got updated recently, but it seems like that has to be what created a major compatibility issue, since I was able to use this notebook just a couple of weeks ago

shankari commented 8 months ago

This looks like an FAQ with a couple of potential solutions https://github.com/search?q=+libtiff.so.5%3A+cannot+open+shared+object+file&type=code

Not sure why it suddenly broke unless you re-pulled from upstream and got the security upgrade to a more recent version of linux.

Abby-Wheelis commented 8 months ago

This looks like an FAQ with a couple of potential solutions https://github.com/search?q=+libtiff.so.5%3A+cannot+open+shared+object+file&type=code

Many of these suggest that I install libtiff5 manually with apt install libtiff5, but I don't have apt on my mac, and it seems the appropriate mac equivalent would be brew which I can't install without admin privileges.

I tried conda install anaconda::libtiff which is what seems to be the baseline for installing libtiff with conda, but no avail.

conda list produces that I now have geopandas[0.14.2], libtiff [4.6.0], and pyproj [3.6.1]

Reading release notes for libtiff, it seems that the most recent versions are stopping support for lots of features, so maybe I need to try an older libtiff version? I'll try different versions of libtiff and potentially rolling back my local version of this repo until I can find a point where geopandas works to identify what changed.

shankari commented 8 months ago

Many of these suggest that I install libtiff5 manually with apt install libtiff5, but I don't have apt on my mac, and it seems the appropriate mac equivalent would be brew which I can't install without admin privileges.

if you are running em-public-dashboard, you are running the scripts in a linux container. You should be able to use apt as root within the container.

Abby-Wheelis commented 8 months ago

if you are running em-public-dashboard, you are running the scripts in a linux container. You should be able to use apt as root within the container.

I had not realized that, it worked! Now over the geopandas hurdle, added a note in the notebook about the error and how to resolve, should be able to wrap up polishing this script now.

Abby-Wheelis commented 8 months ago

I was able to add a note about resolving that error, make the "paths as variables" update that I wanted to, and run the file without additional hiccups, should be good to go now!

shankari commented 7 months ago

@Abby-Wheelis I have a couple of comments to make this easier for me to review.

I've been holding this off since it is complex, and I hope to be able to finally merge it in the upcoming weeks

Abby-Wheelis commented 7 months ago

@shankari

Abby-Wheelis commented 7 months ago

I have a branch established now, with just my changes before I started working with the TSDC data: branch here

Let me know if it would be easiest for you if I submitted the "pre TSDC" branch as it's own PR, and then this one could be treated as a follow-on!

Abby-Wheelis commented 6 months ago

@shankari do you want to check Cemal's code (what I based my code on) into the repository or should I remove it? I'll carry whatever we decide across both PRs to keep it consistent.

Abby-Wheelis commented 5 months ago

Large chunk of refactoring now done on this branch, based on commentary from @iantei on #118, this PR is now much smaller, with the elimination of older code and significant reduction in duplicate code. Changes fall in three folders: