gmaze commented 1 year ago

Temporary set-up

Clone this repo with git
Checkout this branch imev-bgc

In your notebook or at the beginning of a script, import argopy in dev mode for this branch:

# Importing argopy in dev mode:
import sys
sys.path.insert(0, "/Users/gmaze/git/github/euroargodev/argopy")  # Update with your own local path
import argopy

New features developed

[x] ArgoIndex.search_parameter_data_mode method
[x] Improved ArgoIndex.search_params: parameters are now properly searched, not only the substring occurence. This is slower than before, because now strings in the parameters column have to be split and params searched in the resulting list.
[x] Add wmo and cycle number in the output of ArgoIndex.to_dataframe
[x] Update the list of float models and institutions populating the ArgoIndex.to_dataframe output using NVS tables instead of fixed pickle file assets. Update the load_dict utility function to fetch data from the Argo Reference Table server (NVS) https://github.com/euroargodev/argopy/pull/278/commits/0b4c817d625d8f590956852851c19b62fb30981f
[x] Erddap fetcher. At this point, we must request a list of parameters but we can't know which parameters will be empty, so we can't make a minimal requests with points having all requested variables. How can we optimise this in order to reduce the list of requested parameters to parameters available in each profile. Solved by combining information retrieved from the ArgoIndex of S files. See below
[x] Erddap fetcher. We can't get the PARAMETER_DATA_MODE from the erddap. So we must combine erddap results with info from the S-profiles ArgoIndex. This is working, but very slow, we need to find an optimized solution. Solution in https://github.com/euroargodev/argopy/pull/278/commits/c1cb2e78bb9dbef7114bfa0b2498f87f5ee154a6
[x] Determine the DATA_MODE/QC_FLAG workflow for each BGC variables to be used in standard mode, check #280

Post trip todo list before merge

[x] Implement ArgoIndex.search_parameter_data_mode method in the pyarrow backend
[x] Update ArgoIndex.search_params method for the pyarrow backend
[x] Add CI tests for new methods (partial support here, full in another PR)
[x] Update the documentation (simply update readme, full in another #285 )

More things to look at

[ ] Add the possibility to chain multiple Index search methods, for instance: ArgoIndex().search_params().search_tim()
[ ] Expose in utilities a function to easily parallelize a diagnostic to be executed on each profiles. The peace of code is already in groupby_remap and linear_interpolation_remap
[ ] Add new methods to the xarray argo accessor to compute additional BGC variables
[ ] Fetch parameter data mode from the EA webAPI instead of the ArgoIndex, is it faster ?

gmaze commented 1 year ago

search BGC index by parameter data mode

from argopy.stores import indexstore_pd as ArgoIndex  # make sure to work with Pandas index store (like in Raphaelle code)
idx = ArgoIndex(index_file='argo_bio-profile_index.txt', cache=True).load()
idx.search_parameter_data_mode({'BBP700': 'D'})
idx.search_parameter_data_mode({'BBP700': 'D', 'DOXY': 'D'}, logical='or')
idx.search_parameter_data_mode({'DOXY': ['R', 'A']})

todo:

[x] add similar code to pyarrow backend

gmaze commented 1 year ago

Check search_parameter_data_mode output:

idx = ArgoIndex(index_file='argo_bio-profile_index.txt').load()
<argoindex.pandas>
Host: https://data-argo.ifremer.fr/
Index: argo_bio-profile_index.txt
Convention: argo_bio-profile_index (Bio-Profile directory file of the Argo GDAC)
Loaded: True (286046 records)
Searched: False

# param = 'BBP470'
param = 'DOXY'
n = []
for dm in ['R', 'A', 'D', ' ', '']: 
    # Blank string is where no data mode is found
    # Empty string will count profiles without the parameter
    n.append(idx.search_parameter_data_mode({param: dm}).N_MATCH)
    print("mode='%s', N=%i" % (dm, n[-1]))
mode='R', N=23822
mode='A', N=74357
mode='D', N=174362
mode=' ', N=468
mode='', N=13063

# Number of profiles with this PARAMETER (and decomposition by data mode):
idx.search_params(param).N_MATCH, np.sum(n[0:-1])
(273009, 273009)

# Check that we found all expected data modes:
idx.N_RECORDS, np.sum(n)
(286072, 286072)

gmaze commented 1 year ago

Scatter maps with one BGC variable data mode

# Load the index of synthetic profiles (work with B index as well):
idx = ArgoIndex(index_file='argo_synthetic-profile_index.txt').load()

# Define a parameter to work with:
param = 'BBP470'
param = 'DOXY'
# param = 'BBP700'

# Search parameter profiles:
idx.search_params(param)

# Then add a search in time (just to make an readable map):
idx.index = idx.search  # Trick to be able to chain multiple search methods with a single idx instance
idx.search_tim([-180,180,-90,90,'2023-01','2023-07'])

# Export the index dataframe:
df = idx.to_dataframe()

# To make the data mode plot, we need to have it a single column:
# so we need to add a new column with the DATA_MODE of the PARAMETER
df["variables"] = df["parameters"].apply(lambda x: x.split())
df["%s_DM" % param] = df.apply(lambda x: x['parameter_data_mode'][x['variables'].index(param)] if param in x['variables'] else '', axis=1)

# Finally plot the map:
from argopy.plot import scatter_map
scatter_map(df,
            hue="%s_DM" % param,
            cmap="data_mode",
            figsize=(10,6),
            markersize=2,
            markeredgecolor=None,
            traj=False,  # Because some floats do weird things around 180/-180
            set_global=False,
            legend_title='%s data mode' % param)

Screenshot 2023-06-08 at 15 00 35

gmaze commented 1 year ago

DataFetcher new arguments

In order to handle the large amount of BGC variables (120 !), we must find a way to be specific about what we want and need ! So in this PR, we experiment with the following:

The params argument. Use to specify which variables will be returned, whatever their values or availability in some floats returned in the access point.

DataFetcher(ds='bgc')  # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='all')  # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='DOXY') # Only the DOXY variable will be returned
DataFetcher(ds='bgc', params=['DOXY', 'BBP700']) # Only DOXY and BBP700 will be returned
DataFetcher(ds='bgc', params=['*DOXY*', '*BBP*']) # All variables with DOXY or BBP in their name will be returned

The core parameters PRES, TEMP and PSAL will always be returned

The measured argument. Use to specify which variables cannot be NaN and must return values. This is very useful to reduce a dataset to points where all variables are available.

DataFetcher(ds='bgc', measured='all')  # All variables found in the access point cannot be NaNs
DataFetcher(ds='bgc', measured='DOXY') # Only the DOXY cannot be NaNs
DataFetcher(ds='bgc', measured=['DOXY', 'BBP700']) # Only DOXY and BBP700 cannot be NaNs
DataFetcher(ds='bgc', measured=None)  # None of the variables found in the access point cannot be nan, i.e. all variables are allowed to have NaNs

and of course, we can combine them:

DataFetcher(ds='bgc', params='all', measured=None)  # Return the largest possible dataset
DataFetcher(ds='bgc', params='all', measured='all')  # Return the smallest possible dataset
# or
DataFetcher(ds='bgc', params='all', measured=['DOXY', 'BBP700'])  # Return all possible variables or points where DOXY and BBP700 are not NaNsl

Erddap fetcher implementation

On the erddap fetcher, I thus added one internal instance of the ArgoIndex fetcher.

Every time we call on the fetcher uri property, the erddap internally calls the _minimal_vlist property. This property is generated on the fly and returns the list of variables to retrieve from the erddap. This list used to be hard coded, but this is not possible for BGC.

When the fetcher is instantiated with the params='all' argument, we use the ArgoIndex to get the exact list of Argo parameters to retrieve, which is given by all variables found in the parameters column of the index file (obviously, the index is searched using the fetcher access point information, for a region, floats, profiles).

When the fetcher is instantiated with the measured argument, we add more constraints on the erddap url request. These new constraints are that the BGC+core parameters must not be NaNs. The list of parameters under such a constraint is based on the user input or the ArgoIndex census if the keyword all was used (it is overwritten by params values if necessary).

gmaze commented 1 year ago

Just a word to give an update about this PR:

this is taking a while to move forward because I faced a lot of issues when moving this to prod.
right now, I'm working on optimizing data fetching and variables management
I had to dev. quite a few larger-than-BGC scopes methods to face BGC challenges (hello threadings pool custom dashboard)

euroargodev / argopy

New functions and methods dedicated to ARGO-BGC, from work at IMEV #278

Temporary set-up

New features developed

Post trip todo list before merge

More things to look at

search BGC index by parameter data mode

todo:

Check search_parameter_data_mode output:

Scatter maps with one BGC variable data mode

DataFetcher new arguments

Erddap fetcher implementation