fornax-navo / fornax-demo-notebooks

Demo notebooks for the Fornax project
https://fornax-navo.github.io/fornax-demo-notebooks/
BSD 3-Clause "New" or "Revised" License
7 stars 19 forks source link

GSoC2024: Dask Starter Problem #240

Open troyraen opened 3 months ago

troyraen commented 3 months ago

Dask Starter Problem

This is a toy problem meant as a pre-application exercise for the GSOC project Enable Dask execution of NASA time-domain analysis.

Estimated completion time: a few hours

Please complete all required tasks and whichever optional task(s) allows you to convey your thought process most easily. The goal is to see what your starting points are, not what you could do with several days of research and optimization.

Overview

The gsoc/2024/dask-toy-problem branch contains a directory called gsoc-dask-toy-problem/ with one file, archives.py. The file contains four public functions, get_*_lightcurves. These functions vary in their resource usage and runtimes. They take at least one argument, num_sample, which may vary between about 5 and 500,000.

The task is to parallelize these functions using Dask so that they run efficiently for large sample sizes. A basic script to run the functions serially looks like this:

import archives  # be sure to import this **first**
import pandas as pd

num_sample = 100  # may vary between about 5 and 500,000

gaia_df = archives.get_gaia_lightcurves(num_sample)
heasarc_df = archives.get_heasarc_lightcurves(num_sample)
wise_df = archives.get_wise_lightcurves(num_sample)
ztf_df = archives.get_ztf_lightcurves(num_sample)

lightcurves_df = pd.concat([gaia_df, heasarc_df, wise_df, ztf_df])

Instructions

  1. Clone this repo and check out the branch gsoc/2024/dask-toy-problem.
  2. Write code.
    • Required: Start a Dask cluster and stop it when finished. A local cluster is sufficient.
    • Optional: Execute the four archives functions and concatenate the results into a single Pandas DataFrame. Use any Dask method(s) you want. Your code should complete at least as fast as the serial script above. You can look at the code in archives.py, but do not alter it.
  3. Optional: Write text. 300 words max, included as a '.md' file.
    • For any code that you did not write but would if you had more time, write down what you would do.
    • Write down any questions you have that, if answered, might help guide your design choices.
    • Test your code to determine how it scales as num_sample increases. Write down the results, your interpretation of the results, and/or what you would try next to see if it improves your code.
  4. Required: Open a PR with your code and (optional) writeup to merge to gsoc/2024/dask-toy-problem branch. Have GSoC2024 in the title of your PR.

If you have a question, please ask in a comment on this issue.

troyraen commented 3 months ago

Clarifying the anticipated tasks for this GSoC project

Please reread the project description. Development of the entire Science Console is in its early stages, so we don't have much more information to share right now. I will add the few details here that I can. Implementation of a Dask cluster within the Console is being worked on in parallel by a different team. Major tasks for this GSoC project are anticipated to be:

  1. Learn how Dask is expected to be implemented within the Science Console.
  2. Determine an efficient method for running the light curve collection code at scale using the Console's planned Dask implementation.
  3. Write code implementing the solution from step 2.
  4. Test the code.
  5. Iterate steps 2-4, if needed.
  6. Begin work on a different use case, time permitting. (Steps 2-5 on a different code than light curve collection.)

My guess is that the existing light curve collection code will not need to be altered, but that's a guess based on incomplete information. The most relevant detail of the collection code that was not included in the toy problem is this:

One of the functions takes much longer than the others (for large sample sizes). When running the code using Dask, this will have to be addressed in some way.

Here is what we have done to address this so far: To speed up that function, we have added parallelization internally (to that function only) using python’s multiprocessing . This is separate from the parallelization of the full code mentioned in the project description, and two cannot be used together, which limits the effectiveness of the method demonstrated in the notebook. I have recently written a bash script that executes the same underlying python code and outperforms the notebook because it allows that long-running function to use its internal parallelization.