Open troyraen opened 3 months ago
Please reread the project description. Development of the entire Science Console is in its early stages, so we don't have much more information to share right now. I will add the few details here that I can. Implementation of a Dask cluster within the Console is being worked on in parallel by a different team. Major tasks for this GSoC project are anticipated to be:
My guess is that the existing light curve collection code will not need to be altered, but that's a guess based on incomplete information. The most relevant detail of the collection code that was not included in the toy problem is this:
One of the functions takes much longer than the others (for large sample sizes). When running the code using Dask, this will have to be addressed in some way.
Here is what we have done to address this so far:
To speed up that function, we have added parallelization internally (to that function only) using python’s multiprocessing
. This is separate from the parallelization of the full code mentioned in the project description, and two cannot be used together, which limits the effectiveness of the method demonstrated in the notebook. I have recently written a bash script that executes the same underlying python code and outperforms the notebook because it allows that long-running function to use its internal parallelization.
Dask Starter Problem
This is a toy problem meant as a pre-application exercise for the GSOC project Enable Dask execution of NASA time-domain analysis.
Estimated completion time: a few hours
Please complete all required tasks and whichever optional task(s) allows you to convey your thought process most easily. The goal is to see what your starting points are, not what you could do with several days of research and optimization.
Overview
The
gsoc/2024/dask-toy-problem
branch contains a directory calledgsoc-dask-toy-problem/
with one file,archives.py
. The file contains four public functions,get_*_lightcurves
. These functions vary in their resource usage and runtimes. They take at least one argument,num_sample
, which may vary between about 5 and 500,000.The task is to parallelize these functions using Dask so that they run efficiently for large sample sizes. A basic script to run the functions serially looks like this:
Instructions
gsoc/2024/dask-toy-problem
.archives
functions and concatenate the results into a single Pandas DataFrame. Use any Dask method(s) you want. Your code should complete at least as fast as the serial script above. You can look at the code in archives.py, but do not alter it.num_sample
increases. Write down the results, your interpretation of the results, and/or what you would try next to see if it improves your code.gsoc/2024/dask-toy-problem
branch. Have GSoC2024 in the title of your PR.If you have a question, please ask in a comment on this issue.