Open gmaze opened 3 days ago
Hi @gmaze , thanks a lot for your feedback. I imagined that my argopy fetching was not optimized, and I included your changes in the notebook. It took only 7 minutes on my end.
I tried my cloud solution, and it takes less 3 minutes for the same task, of which 1 min to set up the cloud environment and a bit less than 2 minutes to retrieve the data with the pyarrow+pandas approach. I'll try with dask in the next days and hopefully upload a notebook with both before Friday's meeting.
Note also that my parquet database uses the profile files (those ending in '_Sprof.nc' for the BGC floats), so if argopy defaults to fetching each individual profile, it's probably sorting through more and more disperse data.
thanks for updating the notebook @enrico-mi !
I still noticed that the text still reports Exception being thrown, which I guess should not be the case anymore with the parallel=True
option
your perf are very promising ! I would love to add a data source fetcher for this format in argopy, in order to compare with other solutions in the argopy data processing chain Is there any chance the experimental pq files would be available on AWS S3 ?
Also note that with the Ifremer erddap as a data source, argopy fetches data from '_prof.nc' or '_Sprof.nc' for the BGC floats, so we're accessing individual profile files
thank you for catching that @gmaze , I knew I shouldn't have pushed at the end of the day..!
About access to the AWS S3 files: we need to discuss that internally, as there are costs associated with it -- I'll get back to you about this, but I'd be happy if it were possible. I'm also happy to be involved in the development of argopy access to the parquet data if it helps.
Great for the data sources: I am also converting the '_prof.nc' and '_Sprof.nc' files to parquet, so we are accessing the same data. Is that the case also for the Core data for argopy-erddap?
my sentence was note very clear:
phy
, we fetche data for the core+deep missions from _prof.nc files served by the Erddapbgc
, we fetche data for the BGC missions from _Sprof.nc files served by the ErddapThanks, then yes, we are fetching the same data.
Hi @enrico-mi very nice notebooks ! I noticed in https://github.com/boom-lab/argo2parquet-public/blob/main/notebooks/Example_1_Map.ipynb that you were trying to compare with fetching data from the Ifremer Erddap server using argopy Very nice use case !
As you mentioned in the notebook, your request is reaching a time out from the Erddap server, indeed because the selection is very large and it takes more than 1 min to prepare
Few remarks:
parallel=True
argument, so that the Erddap request is chunked and let the server handle it. More details in the argopy documentation here: https://argopy.readthedocs.io/en/latest/performances.html#parallel-data-fetching2.the snapshot below shows that the Erddap server is able to handle the large request (argopy is just the intermediary here)
So I'm very curious to see how the remote+parquet behaves !