Open stevehadd opened 4 years ago
A quick test showed that the dask dataframe read_csv function was struggling to read the CSV files as I have currently created them, so we might need to do further preprocessing before we can try out dask performance.
Trying to run the data exploration notebook on MO scitools environment shows that we can't load it all in memory on a VDI using pandas, so we may need to revisitloading using dask to be able to work with whole dataset more easily.
Update: This is less urgent now, as SPICE interactive sessions can be used to get enough memory to load the data into memory.
Further investigation shows that it is the depth profile and temperature profile columns that are causing problems for the dask.dataframe.read_csv function. If I read in the csv using pandas, remove those columns from the pandas dataframe, write a new csv file and then attempt to read using dask, the read is OK. So to fix this we will need to:
Still getting some issues having recreated the CSV files with better formatting of the depth and temperature profile, so further investigation required to see where the problems still remain.
Some docs for using dask on spice using the SLURM cluster functionality:
hyperparameter tuning using dask https://ml.dask.org/hyper-parameter-search.html
Loading of data can be quite slow for what is not a very big dataset. Running the code on Azure Pangeo, we should be able to make use of dask to improve performance. This might require use of daskML https://dask-ml.readthedocs.io/en/latest/