SciQLop / speasy

Space Physics made EASY! A simple Python package to deal with main Space Physics WebServices (CDA,SSC,AMDA,..)
Other
24 stars 7 forks source link

AMDA get_data never returns on long requests #40

Closed Dolgalad closed 2 years ago

Dolgalad commented 2 years ago

Description

AMDA creates background jobs to deal with requests that take too long to answer (timeout exceeded). Speasy is not notified of this fact, thus when trying to retrieve a large dataset the get_data method may never return, it will wait indefinitely.

What I Did

import speasy as spz
param_id = "amda/solo_b_rtn_hr"
start = "2020/01/01T00:00:00"
stop = "2021/01/01T00:00:00"
p = spz.get_data(param_id, start, stop)

Solution

Modified the dl_parameter function in speasy.webservices.amda._impl module :

import numpy as np
from datetime import timedelta
....
    def parameter_concat(self, param1, param2):
        """Concatenate parameters
        """
        if param1 is None and param2 is None:
            return None
        if param1 is None:
            return param2
        if param2 is None:
            return param1
        param1.time = np.hstack((param1.time, param2.time))
        param1.data = np.hstack((param1.data, param2.data))
        return param1

    def dl_parameter(self, start_time: datetime, stop_time: datetime, parameter_id: str, **kwargs) -> Optional[
        SpeasyVariable]:
        if isinstance(start_time, datetime):
            start_time = start_time.timestamp()
        if isinstance(stop_time, datetime):
            stop_time = stop_time.timestamp()
        dt = timedelta(days=1).total_seconds()
        if stop_time - start_time > dt:
            var = None
            curr_t = start_time
            while curr_t < stop_time:
                #print(f"Getting block {datetime.utcfromtimestamp(curr_t)} -> {datetime.utcfromtimestamp(curr_t + dt)}")
                if curr_t + timedelta(days=1).total_seconds() < stop_time:
                    var = self.parameter_concat(var , self.dl_parameter(curr_t, curr_t + dt, parameter_id, **kwargs))
                else:
                    var = self.parameter_concat(var, self.dl_parameter(curr_t, stop_time, parameter_id, **kwargs))
                curr_t += dt
            return var

        url = rest_client.get_parameter(
            startTime=start_time, stopTime=stop_time, parameterID=parameter_id, timeFormat='UNIXTIME',
            server_url=self.server_url, **kwargs)
        if url is not None:
            var = load_csv(url)
            if len(var):
                log.debug(
                    f'Loaded var: data shape = {var.values.shape}, data start time = {datetime.utcfromtimestamp(var.time[0])}, data stop time = {datetime.utcfromtimestamp(var.time[-1])}')
            else:
                log.debug('Loaded var: Empty var')
            return var
        return None
jeandet commented 2 years ago

Reading this issue makes me think about introducing a max_request_duration parameter that could be 1 day by default. Because the best would be to have a maximum data size for requests but this is really hard to evaluate. Letting the user override this value allows to increase it for really slow datasets where a several days request would be OK and likely more efficient.

brenard-irap commented 2 years ago

In AMDA, when a "getParameter" request time is greater than 4 minutes, the execution enter in a batch mode.

In this case, the result will look like:

{
    "success": true,
    "status": "in progress",
    "id": "process_ucuGXR_1650348560_252656"
}

In this condition, getStatus API can be used to retrieve the status. This API can be called until the request is complete. And when it's done, the result should be:

{
    "success": true,
    "status": "done",
    "dataFileURLs": "http://amda.irap.omp.eu/AMDA//data/WSRESULT/getparameter_mms1_dce_qual_brst_35b6786739efcdc5a74ab1dca29d3b6b_20210101T000000_20210102T000000.txt"
}

It seems that Speasy does not implement this scenario.

For information, our AMDA backend is behind a proxy with a timeout defined as 5 minutes. This is why we need to enter in a "batch mode" when the execution of a request is "too long".

jeandet commented 2 years ago

@brenard-irap we can use this from REST API now?

brenard-irap commented 2 years ago

@jeandet Yes

jeandet commented 2 years ago

Ok, I propose to work on that during next week workshop.

Dolgalad commented 2 years ago

Keep in mind that when timeout is reached the "batch mode" task is created on the server. This means that if a user interrupts speasy while its getting data, the task will keep running on the server, this is why I don't like the timeout solution. Splitting the time range into intervals means that if the user interrupts the process only a single block of data will be requested from the server. Splitting the data also provides a natural way of notifying the user of the progress of the request (functionality I find useful when dealing with long time periods).

Another simple way of dealing with this problem is to raise an Exception if a timeout is reached. The value of the timeout needs to be smaller than the 4 minutes used by AMDA.

Dolgalad commented 2 years ago

PR #41