boom-lab / argo2parquet-public

GNU General Public License v3.0
2 stars 0 forks source link

Use of argopy for comparisons #2

Open gmaze opened 3 days ago

gmaze commented 3 days ago

Hi @enrico-mi very nice notebooks ! I noticed in https://github.com/boom-lab/argo2parquet-public/blob/main/notebooks/Example_1_Map.ipynb that you were trying to compare with fetching data from the Ifremer Erddap server using argopy Very nice use case !

As you mentioned in the notebook, your request is reaching a time out from the Erddap server, indeed because the selection is very large and it takes more than 1 min to prepare

Few remarks:

  1. the argopy trick to handle large requests is to use the parallel=True argument, so that the Erddap request is chunked and let the server handle it. More details in the argopy documentation here: https://argopy.readthedocs.io/en/latest/performances.html#parallel-data-fetching

2.the snapshot below shows that the Erddap server is able to handle the large request (argopy is just the intermediary here) Screenshot 2024-09-16 at 13 57 37

  1. ... but in about 12mins... the solution local+parquet is much faster than remote+erddap+netcdf

So I'm very curious to see how the remote+parquet behaves !

enrico-mi commented 3 days ago

Hi @gmaze , thanks a lot for your feedback. I imagined that my argopy fetching was not optimized, and I included your changes in the notebook. It took only 7 minutes on my end.

I tried my cloud solution, and it takes less 3 minutes for the same task, of which 1 min to set up the cloud environment and a bit less than 2 minutes to retrieve the data with the pyarrow+pandas approach. I'll try with dask in the next days and hopefully upload a notebook with both before Friday's meeting.

Note also that my parquet database uses the profile files (those ending in '_Sprof.nc' for the BGC floats), so if argopy defaults to fetching each individual profile, it's probably sorting through more and more disperse data.

gmaze commented 1 day ago

thanks for updating the notebook @enrico-mi ! I still noticed that the text still reports Exception being thrown, which I guess should not be the case anymore with the parallel=True option

your perf are very promising ! I would love to add a data source fetcher for this format in argopy, in order to compare with other solutions in the argopy data processing chain Is there any chance the experimental pq files would be available on AWS S3 ?

Also note that with the Ifremer erddap as a data source, argopy fetches data from '_prof.nc' or '_Sprof.nc' for the BGC floats, so we're accessing individual profile files

enrico-mi commented 1 day ago

thank you for catching that @gmaze , I knew I shouldn't have pushed at the end of the day..!

About access to the AWS S3 files: we need to discuss that internally, as there are costs associated with it -- I'll get back to you about this, but I'd be happy if it were possible. I'm also happy to be involved in the development of argopy access to the parquet data if it helps.

Great for the data sources: I am also converting the '_prof.nc' and '_Sprof.nc' files to parquet, so we are accessing the same data. Is that the case also for the Core data for argopy-erddap?

gmaze commented 1 day ago

my sentence was note very clear:

enrico-mi commented 1 day ago

Thanks, then yes, we are fetching the same data.