AMDA get_data never returns on long requests

Dolgalad commented 2 years ago

Space Physics WebServices Client version: 0.10.1
Python version: 3.8.10
Operating System: Ubuntu

Description

AMDA creates background jobs to deal with requests that take too long to answer (timeout exceeded). Speasy is not notified of this fact, thus when trying to retrieve a large dataset the get_data method may never return, it will wait indefinitely.

What I Did

import speasy as spz
param_id = "amda/solo_b_rtn_hr"
start = "2020/01/01T00:00:00"
stop = "2021/01/01T00:00:00"
p = spz.get_data(param_id, start, stop)

Solution

Modified the dl_parameter function in speasy.webservices.amda._impl module :

import numpy as np
from datetime import timedelta
....
    def parameter_concat(self, param1, param2):
        """Concatenate parameters
        """
        if param1 is None and param2 is None:
            return None
        if param1 is None:
            return param2
        if param2 is None:
            return param1
        param1.time = np.hstack((param1.time, param2.time))
        param1.data = np.hstack((param1.data, param2.data))
        return param1

    def dl_parameter(self, start_time: datetime, stop_time: datetime, parameter_id: str, **kwargs) -> Optional[
        SpeasyVariable]:
        if isinstance(start_time, datetime):
            start_time = start_time.timestamp()
        if isinstance(stop_time, datetime):
            stop_time = stop_time.timestamp()
        dt = timedelta(days=1).total_seconds()
        if stop_time - start_time > dt:
            var = None
            curr_t = start_time
            while curr_t < stop_time:
                #print(f"Getting block {datetime.utcfromtimestamp(curr_t)} -> {datetime.utcfromtimestamp(curr_t + dt)}")
                if curr_t + timedelta(days=1).total_seconds() < stop_time:
                    var = self.parameter_concat(var , self.dl_parameter(curr_t, curr_t + dt, parameter_id, **kwargs))
                else:
                    var = self.parameter_concat(var, self.dl_parameter(curr_t, stop_time, parameter_id, **kwargs))
                curr_t += dt
            return var

        url = rest_client.get_parameter(
            startTime=start_time, stopTime=stop_time, parameterID=parameter_id, timeFormat='UNIXTIME',
            server_url=self.server_url, **kwargs)
        if url is not None:
            var = load_csv(url)
            if len(var):
                log.debug(
                    f'Loaded var: data shape = {var.values.shape}, data start time = {datetime.utcfromtimestamp(var.time[0])}, data stop time = {datetime.utcfromtimestamp(var.time[-1])}')
            else:
                log.debug('Loaded var: Empty var')
            return var
        return None

jeandet commented 2 years ago

Reading this issue makes me think about introducing a max_request_duration parameter that could be 1 day by default. Because the best would be to have a maximum data size for requests but this is really hard to evaluate. Letting the user override this value allows to increase it for really slow datasets where a several days request would be OK and likely more efficient.

brenard-irap commented 2 years ago

In AMDA, when a "getParameter" request time is greater than 4 minutes, the execution enter in a batch mode.

In this case, the result will look like:

{
    "success": true,
    "status": "in progress",
    "id": "process_ucuGXR_1650348560_252656"
}

In this condition, getStatus API can be used to retrieve the status. This API can be called until the request is complete. And when it's done, the result should be:

{
    "success": true,
    "status": "done",
    "dataFileURLs": "http://amda.irap.omp.eu/AMDA//data/WSRESULT/getparameter_mms1_dce_qual_brst_35b6786739efcdc5a74ab1dca29d3b6b_20210101T000000_20210102T000000.txt"
}

It seems that Speasy does not implement this scenario.

For information, our AMDA backend is behind a proxy with a timeout defined as 5 minutes. This is why we need to enter in a "batch mode" when the execution of a request is "too long".

jeandet commented 2 years ago

@brenard-irap we can use this from REST API now?

brenard-irap commented 2 years ago

@jeandet Yes

jeandet commented 2 years ago

Ok, I propose to work on that during next week workshop.

Dolgalad commented 2 years ago

Keep in mind that when timeout is reached the "batch mode" task is created on the server. This means that if a user interrupts speasy while its getting data, the task will keep running on the server, this is why I don't like the timeout solution. Splitting the time range into intervals means that if the user interrupts the process only a single block of data will be requested from the server. Splitting the data also provides a natural way of notifying the user of the progress of the request (functionality I find useful when dealing with long time periods).

Another simple way of dealing with this problem is to raise an Exception if a timeout is reached. The value of the timeout needs to be smaller than the 4 minutes used by AMDA.

Dolgalad commented 2 years ago

PR #41

SciQLop / speasy