Closed gmaze closed 1 year ago
from argopy.stores import indexstore_pd as ArgoIndex # make sure to work with Pandas index store (like in Raphaelle code)
idx = ArgoIndex(index_file='argo_bio-profile_index.txt', cache=True).load()
idx.search_parameter_data_mode({'BBP700': 'D'})
idx.search_parameter_data_mode({'BBP700': 'D', 'DOXY': 'D'}, logical='or')
idx.search_parameter_data_mode({'DOXY': ['R', 'A']})
idx = ArgoIndex(index_file='argo_bio-profile_index.txt').load()
<argoindex.pandas>
Host: https://data-argo.ifremer.fr/
Index: argo_bio-profile_index.txt
Convention: argo_bio-profile_index (Bio-Profile directory file of the Argo GDAC)
Loaded: True (286046 records)
Searched: False
# param = 'BBP470'
param = 'DOXY'
n = []
for dm in ['R', 'A', 'D', ' ', '']:
# Blank string is where no data mode is found
# Empty string will count profiles without the parameter
n.append(idx.search_parameter_data_mode({param: dm}).N_MATCH)
print("mode='%s', N=%i" % (dm, n[-1]))
mode='R', N=23822
mode='A', N=74357
mode='D', N=174362
mode=' ', N=468
mode='', N=13063
# Number of profiles with this PARAMETER (and decomposition by data mode):
idx.search_params(param).N_MATCH, np.sum(n[0:-1])
(273009, 273009)
# Check that we found all expected data modes:
idx.N_RECORDS, np.sum(n)
(286072, 286072)
# Load the index of synthetic profiles (work with B index as well):
idx = ArgoIndex(index_file='argo_synthetic-profile_index.txt').load()
# Define a parameter to work with:
param = 'BBP470'
param = 'DOXY'
# param = 'BBP700'
# Search parameter profiles:
idx.search_params(param)
# Then add a search in time (just to make an readable map):
idx.index = idx.search # Trick to be able to chain multiple search methods with a single idx instance
idx.search_tim([-180,180,-90,90,'2023-01','2023-07'])
# Export the index dataframe:
df = idx.to_dataframe()
# To make the data mode plot, we need to have it a single column:
# so we need to add a new column with the DATA_MODE of the PARAMETER
df["variables"] = df["parameters"].apply(lambda x: x.split())
df["%s_DM" % param] = df.apply(lambda x: x['parameter_data_mode'][x['variables'].index(param)] if param in x['variables'] else '', axis=1)
# Finally plot the map:
from argopy.plot import scatter_map
scatter_map(df,
hue="%s_DM" % param,
cmap="data_mode",
figsize=(10,6),
markersize=2,
markeredgecolor=None,
traj=False, # Because some floats do weird things around 180/-180
set_global=False,
legend_title='%s data mode' % param)
In order to handle the large amount of BGC variables (120 !), we must find a way to be specific about what we want and need ! So in this PR, we experiment with the following:
The params argument. Use to specify which variables will be returned, whatever their values or availability in some floats returned in the access point.
DataFetcher(ds='bgc') # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='all') # All variables found in the access point will be returned
DataFetcher(ds='bgc', params='DOXY') # Only the DOXY variable will be returned
DataFetcher(ds='bgc', params=['DOXY', 'BBP700']) # Only DOXY and BBP700 will be returned
DataFetcher(ds='bgc', params=['*DOXY*', '*BBP*']) # All variables with DOXY or BBP in their name will be returned
The core parameters PRES, TEMP and PSAL will always be returned
The measured argument. Use to specify which variables cannot be NaN and must return values. This is very useful to reduce a dataset to points where all variables are available.
DataFetcher(ds='bgc', measured='all') # All variables found in the access point cannot be NaNs
DataFetcher(ds='bgc', measured='DOXY') # Only the DOXY cannot be NaNs
DataFetcher(ds='bgc', measured=['DOXY', 'BBP700']) # Only DOXY and BBP700 cannot be NaNs
DataFetcher(ds='bgc', measured=None) # None of the variables found in the access point cannot be nan, i.e. all variables are allowed to have NaNs
and of course, we can combine them:
DataFetcher(ds='bgc', params='all', measured=None) # Return the largest possible dataset
DataFetcher(ds='bgc', params='all', measured='all') # Return the smallest possible dataset
# or
DataFetcher(ds='bgc', params='all', measured=['DOXY', 'BBP700']) # Return all possible variables or points where DOXY and BBP700 are not NaNsl
On the erddap fetcher, I thus added one internal instance of the ArgoIndex fetcher.
Every time we call on the fetcher uri
property, the erddap internally calls the _minimal_vlist
property.
This property is generated on the fly and returns the list of variables to retrieve from the erddap. This list used to be hard coded, but this is not possible for BGC.
When the fetcher is instantiated with the params='all'
argument, we use the ArgoIndex to get the exact list of Argo parameters to retrieve, which is given by all variables found in the parameters
column of the index file (obviously, the index is searched using the fetcher access point information, for a region, floats, profiles).
When the fetcher is instantiated with the measured
argument, we add more constraints on the erddap url request. These new constraints are that the BGC+core parameters must not be NaNs. The list of parameters under such a constraint is based on the user input or the ArgoIndex census if the keyword all
was used (it is overwritten by params
values if necessary).
Just a word to give an update about this PR:
Temporary set-up
imev-bgc
New features developed
ArgoIndex.search_parameter_data_mode
methodArgoIndex.search_params
: parameters are now properly searched, not only the substring occurence. This is slower than before, because now strings in theparameters
column have to be split and params searched in the resulting list.ArgoIndex.to_dataframe
ArgoIndex.to_dataframe
output using NVS tables instead of fixed pickle file assets. Update the load_dict utility function to fetch data from the Argo Reference Table server (NVS) https://github.com/euroargodev/argopy/pull/278/commits/0b4c817d625d8f590956852851c19b62fb30981fstandard
mode, check #280Post trip todo list before merge
ArgoIndex.search_parameter_data_mode
method in the pyarrow backendArgoIndex.search_params
method for the pyarrow backendMore things to look at
ArgoIndex().search_params().search_tim()
groupby_remap
andlinear_interpolation_remap