euroargodev / argopy

A python library for Argo data beginners and experts
https://argopy.readthedocs.io
European Union Public License 1.2
184 stars 41 forks source link

Inconsistent behavior across data sources and floats without conductivity sensors #288

Closed dhruvbalwada closed 1 year ago

dhruvbalwada commented 1 year ago

I and @andrewfagerheim have been facing a problem as we have noticed that not all floats have the same core data variables. In particular, some floats do not have conductivity sensors. An example of this float is 29029, but there are many more. This becomes a total failure problem if you try to load floats over a region that has any of these floats (from any data sources apart from erddap).

This is a reopening of issue #228, since we have now had time to do a deeper dive into the problem. Originally @gmaze had told us that he was not able to reproduce our problem as he was using erddap, and now we understand partially why (as explained below). Obviously erddap is a great way to access small parts of data, but basically not very useful if you want to do a global analysis - as it runs into problems if the datasets get too large (e.g. #287). Some of these can be mitigated through the tricks explained here, but ideally it would be best to be able to use a local gdac for global or large data analysis as no other method can beat that speed of that option.

We would really appreciate it if some help can be provided on this matter from the devs, who understand Argo data access much better than us.

MCVE Code Sample

from argopy import DataFetcher as ArgoDataFetcher

local_gdac = ArgoDataFetcher(src='gdac',ftp="/swot/SUM05/dbalwada/Argo_sync") # this line won't work if you dont have local dac. 
erddap = ArgoDataFetcher(src='erddap')
ftp_ifremer_gdac = ArgoDataFetcher(src='gdac',ftp="ftp://ftp.ifremer.fr/ifremer/argo")
ftp_usgodae_gdac = ArgoDataFetcher(src='gdac',ftp="ftp://usgodae.org/pub/outgoing/argo")
argovis = ArgoDataFetcher(src='argovis')

ds_float_local = local_gdac.float(29029).load()
ds_float_local .to_xarray()

ds_float_ftp_i = ftp_ifremer_gdac.float(29029).load()
ds_float_ftp_i.to_xarray()

ds_float_ftp_u= ftp_usgodae_gdac.float(29029).load()
ds_float_ftp_i.to_xarray()

ds_float_erddap = erddap.float(29029).load()
ds_float_erddap.to_xarray()

ds_float_argovis = argovis.float(29029).load()
ds_float_argovis.to_xarray()

Expected Output

The expected out is that all the dsfloat* will show the same data. Which looks something like

Screen Shot 2023-08-04 at 6 48 00 PM

Problem Description

However, the behavior is the following:

Versions

Argopy version 0.1.12

More details.

dhruvbalwada commented 1 year ago

As a follow up, a second thing that is noticed is a difference in response between ftp gdac and local gdac was created using the rync instructions available in the documentation today morning).

Can be checked by seeing output of the following 4:

It is interesting that there is a difference in output from last 2 calls, suggesting that the behavior of the data fetcher is different between local and remote ftp. Regardless, the returned xarray dataset from the ftp (when we don't get error) is not acceptable as we would prefer to get salinity from floats that the salinity data exists and get back no data from floats that have no salinity sensor (or get back a variable populated with missing values and appropriate QC flag).

dhruvbalwada commented 1 year ago

Updating to the latest version seems to have solved this. Closing for now.

gmaze commented 1 year ago

Hi @dhruvbalwada Indeed, depending on the data source, one or more chunks of data have to be merged eventually by argopy. By default, up to now, the internal argopy policy is to merge "down", i.e. to drop variables not available in all chunks. This is a simple way to limit not-always-necessary large dataset

Regardless, the returned xarray dataset from the ftp (when we don't get error) is not acceptable as we would prefer to get salinity from floats that the salinity data exists and get back no data from floats that have no salinity sensor

This is not the preference for everyone in all situations, especially when dealing with BGC variables.

that being said, with the last 0.14rc2 release, argopy has the ability to let users to modify this behaviour for the BGC dataset using the erddap data source. we plan on making this choice available to all data sources and data sets in the very near future.