euroargodev / argopy

A python library for Argo data beginners and experts
https://argopy.readthedocs.io
European Union Public License 1.2
184 stars 41 forks source link

New functions and methods dedicated to ARGO-BGC, from work at IMEV #278

Closed gmaze closed 1 year ago

gmaze commented 1 year ago

Temporary set-up

New features developed

Post trip todo list before merge

More things to look at

gmaze commented 1 year ago

search BGC index by parameter data mode

from argopy.stores import indexstore_pd as ArgoIndex  # make sure to work with Pandas index store (like in Raphaelle code)
idx = ArgoIndex(index_file='argo_bio-profile_index.txt', cache=True).load()
idx.search_parameter_data_mode({'BBP700': 'D'})
idx.search_parameter_data_mode({'BBP700': 'D', 'DOXY': 'D'}, logical='or')
idx.search_parameter_data_mode({'DOXY': ['R', 'A']})

todo:

gmaze commented 1 year ago

Check search_parameter_data_mode output:

idx = ArgoIndex(index_file='argo_bio-profile_index.txt').load()
<argoindex.pandas>
Host: https://data-argo.ifremer.fr/
Index: argo_bio-profile_index.txt
Convention: argo_bio-profile_index (Bio-Profile directory file of the Argo GDAC)
Loaded: True (286046 records)
Searched: False

# param = 'BBP470'
param = 'DOXY'
n = []
for dm in ['R', 'A', 'D', ' ', '']: 
    # Blank string is where no data mode is found
    # Empty string will count profiles without the parameter
    n.append(idx.search_parameter_data_mode({param: dm}).N_MATCH)
    print("mode='%s', N=%i" % (dm, n[-1]))
mode='R', N=23822
mode='A', N=74357
mode='D', N=174362
mode=' ', N=468
mode='', N=13063

# Number of profiles with this PARAMETER (and decomposition by data mode):
idx.search_params(param).N_MATCH, np.sum(n[0:-1])
(273009, 273009)

# Check that we found all expected data modes:
idx.N_RECORDS, np.sum(n)
(286072, 286072)
gmaze commented 1 year ago

Scatter maps with one BGC variable data mode

# Load the index of synthetic profiles (work with B index as well):
idx = ArgoIndex(index_file='argo_synthetic-profile_index.txt').load()

# Define a parameter to work with:
param = 'BBP470'
param = 'DOXY'
# param = 'BBP700'

# Search parameter profiles:
idx.search_params(param)

# Then add a search in time (just to make an readable map):
idx.index = idx.search  # Trick to be able to chain multiple search methods with a single idx instance
idx.search_tim([-180,180,-90,90,'2023-01','2023-07'])

# Export the index dataframe:
df = idx.to_dataframe()

# To make the data mode plot, we need to have it a single column:
# so we need to add a new column with the DATA_MODE of the PARAMETER
df["variables"] = df["parameters"].apply(lambda x: x.split())
df["%s_DM" % param] = df.apply(lambda x: x['parameter_data_mode'][x['variables'].index(param)] if param in x['variables'] else '', axis=1)

# Finally plot the map:
from argopy.plot import scatter_map
scatter_map(df,
            hue="%s_DM" % param,
            cmap="data_mode",
            figsize=(10,6),
            markersize=2,
            markeredgecolor=None,
            traj=False,  # Because some floats do weird things around 180/-180
            set_global=False,
            legend_title='%s data mode' % param)

Screenshot 2023-06-08 at 15 00 35

gmaze commented 1 year ago

DataFetcher new arguments

In order to handle the large amount of BGC variables (120 !), we must find a way to be specific about what we want and need ! So in this PR, we experiment with the following:

The params argument. Use to specify which variables will be returned, whatever their values or availability in some floats returned in the access point.

DataFetcher(ds='bgc')  # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='all')  # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='DOXY') # Only the DOXY variable will be returned
DataFetcher(ds='bgc', params=['DOXY', 'BBP700']) # Only DOXY and BBP700 will be returned
DataFetcher(ds='bgc', params=['*DOXY*', '*BBP*']) # All variables with DOXY or BBP in their name will be returned

The core parameters PRES, TEMP and PSAL will always be returned

The measured argument. Use to specify which variables cannot be NaN and must return values. This is very useful to reduce a dataset to points where all variables are available.

DataFetcher(ds='bgc', measured='all')  # All variables found in the access point cannot be NaNs
DataFetcher(ds='bgc', measured='DOXY') # Only the DOXY cannot be NaNs
DataFetcher(ds='bgc', measured=['DOXY', 'BBP700']) # Only DOXY and BBP700 cannot be NaNs
DataFetcher(ds='bgc', measured=None)  # None of the variables found in the access point cannot be nan, i.e. all variables are allowed to have NaNs

and of course, we can combine them:

DataFetcher(ds='bgc', params='all', measured=None)  # Return the largest possible dataset
DataFetcher(ds='bgc', params='all', measured='all')  # Return the smallest possible dataset
# or
DataFetcher(ds='bgc', params='all', measured=['DOXY', 'BBP700'])  # Return all possible variables or points where DOXY and BBP700 are not NaNsl

Erddap fetcher implementation

On the erddap fetcher, I thus added one internal instance of the ArgoIndex fetcher.

Every time we call on the fetcher uri property, the erddap internally calls the _minimal_vlist property. This property is generated on the fly and returns the list of variables to retrieve from the erddap. This list used to be hard coded, but this is not possible for BGC.

When the fetcher is instantiated with the params='all' argument, we use the ArgoIndex to get the exact list of Argo parameters to retrieve, which is given by all variables found in the parameters column of the index file (obviously, the index is searched using the fetcher access point information, for a region, floats, profiles).

When the fetcher is instantiated with the measured argument, we add more constraints on the erddap url request. These new constraints are that the BGC+core parameters must not be NaNs. The list of parameters under such a constraint is based on the user input or the ArgoIndex census if the keyword all was used (it is overwritten by params values if necessary).

gmaze commented 1 year ago

Just a word to give an update about this PR: