cal-itp / data-analyses

Place for sharing quick reports, and works in progress
https://analysis.calitp.org
26 stars 5 forks source link

Epic - HQTA v3 (rewrite for more dask usage) #497

Closed tiffanychu90 closed 1 year ago

tiffanychu90 commented 1 year ago

After receiving a research request, use this template to plan and track your work. Be sure to also add the appropriate project-level label to this issue (eg gtfs-rt, DLA).

Epic Information - HQTA v3

Summary

Research required:

Notes, misc:

Reviewers [Stakeholders]

1.

Issues

Deliverables

tiffanychu90 commented 1 year ago

@ian-r-rose: Can I pick your brain on how to use more dask? A super meaty rewrite of the workflow. Broadly speaking, I say the entire workflow still needs to undergo the following iterations:

  1. incorporate parallelization / partitioning in a particular stage of the data wrangling (clipping line geometries). This stage is covered in 1-2 scripts. Is this a place where dask.futures can be used?
  2. zooming out, looking at the workflow as a whole, how can I distribute across workers and run the scripts sequentially and balance worker loads?
  3. distribute across workers, and maybe have some scripts run in parallel fashion? there are some data downloading steps that can be done in parallel, and, other scripts that are more sequential in nature and have earlier dependencies. Can I separate these and incorporate the scheduler and delay and persist and futures concepts?
ian-r-rose commented 1 year ago

:wave: Hi @tiffanychu90! I would avoid explicit manipulation of distributed.Futures to start. If you are using a distributed cluster (it sounds like you are?) then they will be involved, but I'd suggest starting with higher level APIs like dask.dataframe or dask.delayed.

If things are already in a dask-geopandas dataframe, then I'd suggest keeping things that way. You can do custom processing of individual partitions using df.map_partitions(some_custom_clipping_fn). If things are not already in a dataframe, and you just want to run some pre-processing functions, then I'd suggest wrapping your functions with dask.delayed to execute them on the cluster.

In general, it's not your responsibility to decide how to distribute work across the workers. You decide what tasks to run, and what dependencies they have, and then the scheduler decides how to execute them. If they are embarrassingly parallel, then it should evenly distribute them. If some tasks depend on some others, it will be a bit more complex, but it still tries to make sure that work is well-distributed.

persist certainly can be useful, but mostly if a large number of downstream tasks all depend on some pre-processing step, and you want to make sure that that step isn't cleaned up.

I could probably get a bit more specific if you have some pseudo-code to share.

tiffanychu90 commented 1 year ago

@ian-r-rose: I'm going to digest this and tackle incorporating dask.delayed and df.map_partitions and getting it to run successfully on the distributed cluster. Thanks for the very helpful tips!

tiffanychu90 commented 1 year ago

@ian-r-rose: (feel free to tackle this during the work week!) Do you have feedback for how to use both map_partitions with loops? Haven't implemented dask.delayed yet, but want to try and figure out how to parallelize some of the loops first.

I tried 2 methods in this notebook.

tiffanychu90 commented 1 year ago

hqta_v2 hqta_v3