chihacknight / chn-ghost-buses

"Ghost buses" analysis project through Chi Hack Night
https://github.com/chihacknight/breakout-groups/issues/217
MIT License
19 stars 14 forks source link

Refactor and add incremental workflow functionality. #78

Open haileyplusplus opened 5 months ago

haileyplusplus commented 5 months ago

Description

Add the ability to incrementally update data, refactoring the processing workflow to make this feasible.

This refactors the existing logic into multiple new classes and splits realtime and schedule processing as much as possible. Included are a few other features:

Resolves # [issue]

Type of change

How has this been tested?

Manually

haileyplusplus commented 5 months ago

Some documentation polish is still needed.

lauriemerrell commented 5 months ago

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

haileyplusplus commented 5 months ago

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here.

Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

lauriemerrell commented 5 months ago

Reviewer TODO to myself: check how the date boundary looks with the change in logic

haileyplusplus commented 5 months ago

The download cache part of this is now in PR #80

haileyplusplus commented 4 months ago

Thank you so much for taking this on, Hailey! Excited to talk about it in more detail tomorrow -- I skimmed today but definitely think an overview would be helpful since there's so much here. Just wanted to respond to one note in the description for context:

In addition, this demonstrates but does not fix an existing bug in the processing logic that causes some days on schedule version boundaries to be dropped from output. See the new utils/show_missing_days.py.

This was actually intended behavior because we decided we weren't sure which schedule should be used on boundary dates. We can just pick one if we want, but this was intentional.

Thanks, I hadn't realized that. Good to know.

I decided to keep the existing boundary behavior here. I updated the PR so that the new code that pulls scraped schedules from S3 behaves consistently.

haileyplusplus commented 4 months ago

Ready for review now! Comments welcome.

This is now basically a superset of PR #80, so if desired you could just review this.

haileyplusplus commented 4 months ago

For easier reviewing, try git diff origin/main --color-moved-ws=allow-indentation-change --color-moved=blocks

haileyplusplus commented 3 months ago

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow?

Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

haileyplusplus commented 3 months ago

Thank you again for tackling! A few questions.... Was mostly testing by trying to run things locally. Is there an order that I need to follow? Specifically, when I try to run update_data, I get the following error:

...

Cannot tell if I was supposed to run something to initialize whatever is causing the KeyError?

update_data without arguments is broken for me too. I will work on fixing it. In the meantime, you can test the incremental workflow by pointing to the existing frontend status file with something like this:

python3 -m update_data --update ../ghost-buses-frontend/src/Routes/schedule_vs_realtime_all_day_types_routes.json

It was a simple fix for the non-arguments version, which I just committed. It should work now.

haileyplusplus commented 3 months ago

Ok, I've updated the readme and addressed open comments. I think this is ready for another round of testing and viewing. Please let me know what you think about deferring the transitfeeds changes.