GSOC toy problem for Dask project

troyraen commented 4 months ago

I do not intend to actually merge this to main, I am just using the PR to get feedback. Participants will be instructed to use the gsoc/2024/dask-toy-problem branch directly.

My plan is to open an issue with the following text describing the problem. Please review the text and/or the code in the new file archives.py.

Dask Toy Problem

This is a toy problem meant as a pre-application exercise for the GSOC project Enable Dask execution of NASA time-domain analysis.

Estimated completion time: a few hours

Please complete all required tasks and whichever optional task(s) allows you to convey your thought process most easily. The goal is to see what your starting points are, not what you could do with several days of research and optimization.

Overview

The gsoc/2024/dask-toy-problem branch contains a directory called gsoc-dask-toy-problem/ with one file, archives.py. The file contains four public functions, get_*_lightcurves. These functions vary in their resource usage and runtimes. They take at least one argument, num_sample, which may vary between about 5 and 500,000. The task is to parallelize these functions using Dask so that they run efficiently for large sample sizes.

A basic script to run the functions serially looks like this:

import archives  # be sure to import this **first**
import pandas as pd

num_sample = 100  # may vary between about 5 and 500,000

gaia_df = archives.get_gaia_lightcurves(num_sample)
heasarc_df = archives.get_heasarc_lightcurves(num_sample)
wise_df = archives.get_wise_lightcurves(num_sample)
ztf_df = archives.get_ztf_lightcurves(num_sample)

lightcurves_df = pd.concat([gaia_df, heasarc_df, wise_df, ztf_df])

Instructions

Clone this repo and check out the branch gsoc/2024/dask-toy-problem.
Write code.
- Required: Start a Dask cluster and stop it when finished. A local cluster is sufficient.
- Optional: Execute the four archives functions and concatenate the results into a single Pandas DataFrame. Use any Dask method(s) you want. Your code should complete at least as fast as the serial script above. You can look at the code in archives.py, but do not alter it.
Optional: Write text. 300 words max, included as a '.md' file.
- For any code that you did not write but would if you had more time, write down what you would do.
- Write down any questions you have that, if answered, might help guide your design choices.
- Test your code to determine how it scales as num_sample increases. Write down the results, your interpretation of the results, and/or what you would try next to see if it improves your code.
Required: Open a PR with your code and (optional) writeup to merge to gsoc/2024/dask-toy-problem branch. Have GSoC2024 in the title of your PR.

If you have a question, please ask in a comment on this issue.

jkrick commented 3 months ago

Will other people be able to see the PRs that get submitted? I might not know how to start/close a dask cluster, but I do know how to look at other PRs and copy/paste.
Should we say somewhere up front that we don't expect this to take a long time, so please don't spend a long time on it? The goal is not to know what they could do with 3 days of research and optimization, but to see what their starting points are.

troyraen commented 3 months ago

Will other people be able to see the PRs that get submitted? I might not know how to start/close a dask cluster, but I do know how to look at other PRs and copy/paste.

Yes, everyone will be able to see all PRs. I think this isn't too much of a problem since folks could also just ask a chat bot to write the code.

Should we say somewhere up front that we don't expect this to take a long time, so please don't spend a long time on it? The goal is not to know what they could do with 3 days of research and optimization, but to see what their starting points are.

Yes, especially since this is a pre-application exercise. I'll add something to that effect. I think I'll also reduce the word limit to 300 max.

fornax-navo / fornax-demo-notebooks