Closed troyraen closed 3 months ago
- Will other people be able to see the PRs that get submitted? I might not know how to start/close a dask cluster, but I do know how to look at other PRs and copy/paste.
Yes, everyone will be able to see all PRs. I think this isn't too much of a problem since folks could also just ask a chat bot to write the code.
- Should we say somewhere up front that we don't expect this to take a long time, so please don't spend a long time on it? The goal is not to know what they could do with 3 days of research and optimization, but to see what their starting points are.
Yes, especially since this is a pre-application exercise. I'll add something to that effect. I think I'll also reduce the word limit to 300 max.
I do not intend to actually merge this to
main
, I am just using the PR to get feedback. Participants will be instructed to use thegsoc/2024/dask-toy-problem
branch directly.My plan is to open an issue with the following text describing the problem. Please review the text and/or the code in the new file archives.py.
Dask Toy Problem
This is a toy problem meant as a pre-application exercise for the GSOC project Enable Dask execution of NASA time-domain analysis.
Estimated completion time: a few hours
Please complete all required tasks and whichever optional task(s) allows you to convey your thought process most easily. The goal is to see what your starting points are, not what you could do with several days of research and optimization.
Overview
The
gsoc/2024/dask-toy-problem
branch contains a directory calledgsoc-dask-toy-problem/
with one file,archives.py
. The file contains four public functions,get_*_lightcurves
. These functions vary in their resource usage and runtimes. They take at least one argument,num_sample
, which may vary between about 5 and 500,000. The task is to parallelize these functions using Dask so that they run efficiently for large sample sizes.A basic script to run the functions serially looks like this:
Instructions
gsoc/2024/dask-toy-problem
.archives
functions and concatenate the results into a single Pandas DataFrame. Use any Dask method(s) you want. Your code should complete at least as fast as the serial script above. You can look at the code in archives.py, but do not alter it.num_sample
increases. Write down the results, your interpretation of the results, and/or what you would try next to see if it improves your code.gsoc/2024/dask-toy-problem
branch. Have GSoC2024 in the title of your PR.If you have a question, please ask in a comment on this issue.