Investigate use of dask for improved performance

MetOffice / XBTs_classification

Project for the classification of eXpendable Bathy Thermographs

BSD 3-Clause "New" or "Revised" License

4 stars 2 forks source link

Investigate use of dask for improved performance #2

Open stevehadd opened 4 years ago

stevehadd commented 4 years ago

Loading of data can be quite slow for what is not a very big dataset. Running the code on Azure Pangeo, we should be able to make use of dask to improve performance. This might require use of daskML https://dask-ml.readthedocs.io/en/latest/

stevehadd commented 4 years ago

A quick test showed that the dask dataframe read_csv function was struggling to read the CSV files as I have currently created them, so we might need to do further preprocessing before we can try out dask performance.

stevehadd commented 4 years ago

Trying to run the data exploration notebook on MO scitools environment shows that we can't load it all in memory on a VDI using pandas, so we may need to revisitloading using dask to be able to work with whole dataset more easily.

Update: This is less urgent now, as SPICE interactive sessions can be used to get enough memory to load the data into memory.

stevehadd commented 4 years ago

Further investigation shows that it is the depth profile and temperature profile columns that are causing problems for the dask.dataframe.read_csv function. If I read in the csv using pandas, remove those columns from the pandas dataframe, write a new csv file and then attempt to read using dask, the read is OK. So to fix this we will need to:

reprocess the orginal netcdf files so that we correctly format the depth and temp profile columns in the data frame we create for each year, before using pandas to write the csv
we may also need to write custom formatters for the read function so we have a list of floats in each cell in the temp and depth profile columns, ready to be used.

stevehadd commented 4 years ago

Still getting some issues having recreated the CSV files with better formatting of the depth and temperature profile, so further investigation required to see where the problems still remain.

stevehadd commented 4 years ago

https://ml.dask.org/

Some docs for using dask on spice using the SLURM cluster functionality:

general job queue stuff - https://jobqueue.dask.org/en/latest/howitworks.html
SLURM cluster for running on spice https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

hyperparameter tuning using dask https://ml.dask.org/hyper-parameter-search.html