Epic - HQTA v3 (rewrite for more dask usage)

tiffanychu90 commented 1 year ago

After receiving a research request, use this template to plan and track your work. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Epic Information - HQTA v3

Summary

Ask: which portions of the script can be rewritten to make further use of dask?
The initial work in changing the syntax to dask.dataframes and dask_geopandas is done, but it does take 4-5 hrs to run...too long!
With the dask cluster and dask.distributed client connection in place, try to incorporate partitioning, delays, futures, etc and see which components can be sped up
Opportunities:
- there are components that probably don't need to use loops...we can just do it all at once
- if there is looping needed, aka, each operator has work that needs to be done independently....can we at least do it in a parallel fashion rather than sequentially?
- make use of persist rather than compute wherever possible
- what are futures...are there components to the workflow that lend themselves to using this?
Previous Epic: #321

Research required:

[x] Make use of dask.distributed cluster
[x] aggregate early on across all operators (use the compiled_cached_views/ instead of cached_views):
- stop_times aggregation by time-of-day...which is actually before 12pm and after 12pm and finding the max, this can be done at the start
- picking longest shape_id by direction, then cutting it into hqta segments, this can be done across all operators
- maybe moving this entire script out of a loop across operators?
[x] _there are operator-specific functions, such as looking between operator routes and within operator routes to see if 2 bus lines intersect. But, these can benefit from parallel computing vs sequential...__ --> more elegant / simple logic is to just compare orthogonal segments from the start...no more looking across operators and within operators in nested loops

Notes, misc:

Reviewers [Stakeholders]

1.

Issues

[x] 1. #506: use compiled_cached_views parquets. Conceptual change: no more looping across operators. Find longest shapes for route, find symmetric difference, and cut into hqta segments right away. Separately, aggregate stop_times to stop-level, then join to hqta segments, tag hq transit corridors. Replace B1, B2 scripts
[x] 2. #535: Conceptual change: no more looping across operators. Don't need to look across operators or within operators to see where segments intersect each other. Do it in 1 go by comparing east-west segments against north-south segments. Make better use of pairwise table already and just use df.itertuples to find intersections, instead of moving it between dg.GeoDataFrame and gpd.GeoDataFrame. ReplaceC1, C2, scripts, drop C3 completely, move C4 up (no substantial changes needed for C4, D1, D2, since those deal with assembling the export-ready files)
[x] 3. #548: Use intersection for 2 GeoSeries insetad of df.itertuples...faster, simpler, cleaner
[x] 4. #553: Full run for Nov 2022 HQTA. Simplify the creating of valid_hqta_operators.json by looking for cached stop_times and not all 4 tables. Move towards using more cached parquets instead of running views.gtfs_schedule_index_feed_trip_stops
[x] 5. #569: Use dask.delayed objects to download data and run our queries, rework so that trips, routelines, and stops are downloaded simultaneously. Also test majority of latter scripts for data processing to use dask.distributed cluster and make sure parquets and geoparquets save to GCS. Move all intermediate file storage to GCS, no local.

Deliverables

tiffanychu90 commented 1 year ago

@ian-r-rose: Can I pick your brain on how to use more dask? A super meaty rewrite of the workflow. Broadly speaking, I say the entire workflow still needs to undergo the following iterations:

incorporate parallelization / partitioning in a particular stage of the data wrangling (clipping line geometries). This stage is covered in 1-2 scripts. Is this a place where dask.futures can be used?
zooming out, looking at the workflow as a whole, how can I distribute across workers and run the scripts sequentially and balance worker loads?
distribute across workers, and maybe have some scripts run in parallel fashion? there are some data downloading steps that can be done in parallel, and, other scripts that are more sequential in nature and have earlier dependencies. Can I separate these and incorporate the scheduler and delay and persist and futures concepts?

ian-r-rose commented 1 year ago

:wave: Hi @tiffanychu90! I would avoid explicit manipulation of distributed.Futures to start. If you are using a distributed cluster (it sounds like you are?) then they will be involved, but I'd suggest starting with higher level APIs like dask.dataframe or dask.delayed.

If things are already in a dask-geopandas dataframe, then I'd suggest keeping things that way. You can do custom processing of individual partitions using df.map_partitions(some_custom_clipping_fn). If things are not already in a dataframe, and you just want to run some pre-processing functions, then I'd suggest wrapping your functions with dask.delayed to execute them on the cluster.

In general, it's not your responsibility to decide how to distribute work across the workers. You decide what tasks to run, and what dependencies they have, and then the scheduler decides how to execute them. If they are embarrassingly parallel, then it should evenly distribute them. If some tasks depend on some others, it will be a bit more complex, but it still tries to make sure that work is well-distributed.

persist certainly can be useful, but mostly if a large number of downstream tasks all depend on some pre-processing step, and you want to make sure that that step isn't cleaned up.

I could probably get a bit more specific if you have some pseudo-code to share.

tiffanychu90 commented 1 year ago

@ian-r-rose: I'm going to digest this and tackle incorporating dask.delayed and df.map_partitions and getting it to run successfully on the distributed cluster. Thanks for the very helpful tips!

tiffanychu90 commented 1 year ago

@ian-r-rose: (feel free to tackle this during the work week!) Do you have feedback for how to use both map_partitions with loops? Haven't implemented dask.delayed yet, but want to try and figure out how to parallelize some of the loops first.

I tried 2 methods in this notebook.

The first uses a StackOverflow post that basically concatenates results into a list. It does work successfully, although the speed-up here is minimal.
The second I wasn't able to do successfully with map_partitions, but mostly because I think I'm getting confused with how to deal with groups within the partition. Since I need to just focus on a subset within the same transit operator to do the spatial join, I'm not sure how to handle it, since I basically just bury the loop within the partition. Is this ideal? I'm not sure if I want to set the same number of partitions as there are transit operators, since that would result in 200-ish partitions.

tiffanychu90 commented 1 year ago

hqta_v2 hqta_v3

cal-itp / data-analyses