euroargodev / argopy

A python library for Argo data beginners and experts
https://argopy.readthedocs.io
European Union Public License 1.2
176 stars 38 forks source link

502 error downloading data #369

Closed mikestaub closed 2 months ago

mikestaub commented 2 months ago

When I try to download the global data, I get a 502 error. When I make the ArgoDataFetcher region smaller, the code works as expected. Any idea why?

MCVE Code Sample

import pandas as pd
import matplotlib.pyplot as plt
from argopy import DataFetcher as ArgoDataFetcher
import logging
from datetime import datetime, timedelta
import time

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_argo_data(start_date, end_date, retries=3, delay=5):
    """ Fetch Argo float data for the given date range with retries from USGODAE ERDDAP server. """
    for i in range(retries):
        try:
            logging.info(f"Fetching Argo data from {start_date} to {end_date}")
            ds = ArgoDataFetcher().region([-180, 180, -90, 90, 0, 1000, start_date, end_date]).load().data
            # the following region works:
            # ds = ArgoDataFetcher().region([-75, -45, 20, 30, 0, 10, start_date, end_date]).load().data
            logging.info("Data fetching complete")
            return ds
        except Exception as e:
            logging.error(f"Error fetching data: {e}")
            if i < retries - 1:  # i is zero indexed
                logging.info(f"Retrying in {delay} seconds...")
                time.sleep(delay)
            else:
                raise

def process_data(ds):
    """ Process the data to calculate average ocean temperature. """
    logging.info("Processing data...")
    # Convert to DataFrame for easier manipulation
    df = ds.to_dataframe()

    # Print the columns of the DataFrame for debugging
    logging.info(f"Available columns: {df.columns}")

    # Check if required columns are present
    required_columns = ['TEMP', 'TIME']
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        logging.error(f"Missing columns: {missing_columns}")
        return None

    # Filter for temperature data
    df_temp = df[['TEMP', 'TIME']]

    # Group by date and calculate average temperature
    avg_temp = df_temp.groupby('TIME').mean()['TEMP']
    logging.info("Data processing complete")
    return avg_temp

def plot_data(avg_temp):
    """ Plot the average ocean temperature over time. """
    logging.info("Generating plot...")
    plt.figure(figsize=(10, 6))
    avg_temp.plot(title='Average Ocean Temperature Over Time')
    plt.xlabel('Date')
    plt.ylabel('Temperature (°C)')
    plt.grid(True)
    plt.show()
    logging.info("Plot generated")

def main():
    end_date = datetime.utcnow().strftime('%Y-%m-%d')
    start_date = (datetime.utcnow() - timedelta(days=2)).strftime('%Y-%m-%d')

    ds = fetch_argo_data(start_date, end_date)
    avg_temp = process_data(ds)
    plot_data(avg_temp)

if __name__ == "__main__":
    main()

Expected Output

Problem Description

Versions

Output of `argopy.show_versions()` 0.1.15
gmaze commented 2 months ago

hi @mikestaub it's probably the Ifremer ERDDAP server that is not able to handle such a large data request

Soluce 1: did you tried the parallel=True option in the data fetcher ? it will chunk the domain in peaces to reduce the size of the ERDDAP requests, this should allows to go through

(you can also add the progress=True to monitor what's happening)

Soluce 2: try with the argovis data source, it could be able to handle such a request

Note: I see in your script that you compute the temperature average of the dataset, I don't know your level of Argo data knowledge, but remember that measurements are not equally distributed over pressure; most of the time, measurements are sparser at depths, meaning that they represent a larger chunk of water that measurements near the surface. You can still use the argopy interpoler to project all measurements on standard depth levels or bins.

mikestaub commented 2 months ago

Thanks! I am new to the project and your comment is super helpful. I am unblocked now and will continue playing with the data.

slowpokeeh commented 2 months ago

Hey @gmaze,

I have the same behaviour. But i already tried using parallel=True and i also tried manually setting different chunksizes. I got the same behaviour for argovis. Or am I getting something wrong?

Thank you in Advance!


from argopy import DataFetcher, set_options

regions = [
    [-69, -10, 12, 59, 0, 10, '2021-06', '2022-06']
]

all_data = pd.DataFrame()
with set_options(dataset='phy', src='argovis', mode='standard', api_timeout=300):
    params = 'all'  # Define your parameters here
    fetcher = DataFetcher(params=params,parallel=True,progress=True).region(regions[0])
    print("here3")
    for uri in fetcher.uri:
        print("http: ... ", "&".join(uri.split("&")[1:-2]))  # Display only the relevant part of each URLs of URI:
    for region in regions:
        try:
            f = fetcher.region(region)
            print("fetching region")
            print(region)
            data = f.to_xarray().to_dataframe().reset_index()  # Convert to DataFrame
            all_data = pd.concat([all_data, data], ignore_index=True)
        except FileNotFoundError as e:
            print(f"FileNotFoundError for region {region}: {e}")
all_data.info()```
slowpokeeh commented 2 months ago

Its me again.. I was too fast. Worked with Argovis, but still - chunking with ERDDAP didnt lead to success. Thanks for your work guys!

gmaze commented 2 months ago

alright ! I'm glad that using argovis is solving the issue for your large domain requests !