Closed eliocamp closed 2 years ago
Basically this already happens.wf_request()
just polls the server and checks on the status to grab data using wf_transfer()
.
Basically you just have to make the ct$request_id
an exported value. Then you cycle over this (as a list) to see which one is done and which one isn't. This way you don't need sessions running (assuming that the download time < the wait time for the batch data to show up)!
I just checked and I thought I exported this (the request ID) previously, but it might have been tossed at some point.
OK, it is foreseen here:
https://github.com/khufkens/ecmwfr/blob/60d2b35fc1fd1eaf92505dea5e7b9db36ae79ed3/R/wf_request.R#L187
When no transfer is initiated you exit and forward the ct object (including the request ID).
(take into account that there is a limit of 20 calls for CDS I think)
requests <- list( request_1, request_2 )
ids <- lapply(requests, function(request){
id <- wf_request(request, transfer = FALSE)
})
# then have a loop go over the ids list to check availability of the data
Oh, I didn't realise that! I'll try it out and report back :P
I've ben doing that and have some questions/suggestions, let me know what you think:
wf_transfer()
needs the service
parameter but wf_request()
don't, why is that? I seem to recall that wf_request()
uses information in request
to figure out which service to use, but I'm not sure. wf_transfer()
could extract it. Some thoughts,
Putting more info in the wf_transfer()
, is easy, although checks need to make sure the wf_requests()
routine doesn't fail when tinkering with it. fails might indeed need to be trapped with a try()
statement.
Don't see the construction with the R6 class, could you specify this a bit more. Does this sit at the transfer or request level? An example?
Is the return object of request is an R6 class, then it could be easier to work with them. If the returned object is transfer
, then it could contain all the information (status, id, url, etc..) as well as functions to check and download files in a more consistent manner. For example
transfer$check()
Would make the check and also update the value of transfer$status
(not need to assign the returned object of wf_transfer()
). Alternatively
transfer$download()
Could (potentially) check and download and then update the relevant information (including download status). Names of functions and all that is just illustrative. Furthermore, I think it would be better to create functional wrappers (e.g. wf_transfer <- function(x) x$transfer) because most R users are used to functional syntax.
The advantage is that the information in the object is updated at the same time as it checks for information, which makes it that everything is synced correctly. I think it's doable without R6 classes, but using R6 gives a natural framework.
Does this sit at the transfer or request level? An example?
I think it's natural at the transfer level instead at the request level. A request being just a list of parameters, but when sending it to the server then each transfer is its own unique object.
One caveat I think, because of how CDS and or webapi are coded. transfer$check()
will result in transfer$download()
. If I'm not mistaken I found no way to just get the status of things (without actually downloading the data) if the data is available! Might be best to figure that one out first!
I added the ability to check for status without downloading the full dataset in this PR
Thanks, didn't remember. If you make it work with R6 go for it I would say.
Great! I'll get into it :smile:
Have fun, sorry for not being much help. Been a bit strung out of late.
Thanks, I will. And don't worry about, :)
I've implemented something.
https://github.com/eliocamp/ecmwfr/tree/R6transfer
It passes all tests and seems to be working fine. It's still is a WIP.
I'm using this branch to come up with the parallel request logic and finding it not at all trivial.
Where do you see the stats, as it seems that your branch isn't picked up by Travis.
Sorry, I'm still working in my fork.
I think that I have now something that I'm confident in. I was a bit worried about such a big change in the internal workings of the package but I've tested it and now I'm using it to parallelise hundreds of requests and it's working ok.
The one thing I'm not entirely confident is in the "status" flag. It's kind of tricky because each different service has its own lingo and codes. Do you have the documentation for webapi and cds handy? I can't seem to find it.
Do you have an example of this 'status' flag in use? Link to your changes, branch?
My feeling is that the API isn't too well documented because most work via they provided python interface (which is provide by the institute). This is a common issue among a lot of APIs, poor docs.
Here, for example: https://github.com/eliocamp/ecmwfr/blob/b5fb467f079a12a7b73518b8897d6be42f616140/R/R6_transfer.R#L162
It uses the "status" parameter to test if the request is completed, failed or not completed. That's for CDS. In the webapi case it uses the "code" parameter. I'd like to abstract away the specifics of each API and unify all so that neither the user nor the rest of the code needs to know about it.
Here's one possible example of that. An is_running()
function that tests if the requests is still being processed (or in the queue) that should work regardless of the service.
The error codes of the WebAPI are here: https://confluence.ecmwf.int/display/WEBAPI/Web+API+Troubleshooting
But before expending too much energy on the webapi I would ask when this service will deprecate. I've the feeling that CDS has gained steam and acceptance and soon the webapi will shutter anyway.
It might be good to take this into consideration, and focus mostly on CDS first.
Ok, seems like the webapi will not be merged with CDS. Would seem the sensible thing to do given redundancy but politics probably. Anyway.
The is_running()
method makes sense given that we'll have to juggle services and indeed the lingo would be consistent across services.
Should we transfer the code from your fork to a branch to do some unit testing? This won't break anything and will make tracking things easier.
The issue now with is_runing()
is that it tests whether the request has successfully completed, but returns TRUE
in the case of errors.
I guess that I'm a bit confused about the various states the request can be in for each of the two services and how they are communicated by the respective APIs so I'm having a hard time trying to unify all.
Doesn't the API return 400 statements in calls, or similar? The webapi doesn't I think, where it returns error messages which might be "grep"-able.
I'm not certain. Maybe I should test it out with malformed and/or impossible requests.
Ok, I did some sleuthing and this is what I've got so far, some of which was already in the code.
CDS:
WEBAPI:
BadRequest("mars failed, code 1:\nnice mars /tmp/51/a2/tmp-_mars3O2dt1.req\nmars - WARN - \nmars - WARN - From 29 January 2019 10AM (UTC) MARS uses the interpolation\nmars - WARN - provided by the MIR library. For more details, see\nmars - WARN - https://confluence.ecmwf.int/display/UDOC/MARS+interpolation+with+MIR\nmars - WARN - \nPPDIR is /var/tmp/ppdir/x86_64\nmars - INFO - 20200130.184954 - Welcome to MARS\nmars - INFO - 20200130.184954 - MARS Client build stamp: 20191127160836\nmars - INFO - 20200130.184954 - MARS Client version: 6.27.2\nmars - INFO - 20200130.184954 - MIR version: 1.3.6\nmars - INFO - 20200130.184954 - Using ecCodes version 2.15.0\nmars - INFO - 20200130.184954 - Using odb_api version: 0.15.11 (file format version: 0.5)\nmars - INFO - 20200130.184954 - Using FDB5 version: 5.5.5\nmars - INFO - 20200130.184954 - Maximum retrieval size is 30.00 G\nretrieve,stream=opera,levelist=1/2/3/4/5/6/7/8/9/10/11/12/13/14/
Thanks for the sleuthing, this should be workable I think.
I'm not sure how prone parallel requests will be to failure if one request goes down in an unclean way. Basically we need to avoid that people submit requests (which upon failure makes the whole thing break). The latter isn't technically an issue as you can still go to your dashboard online and grab the data there (if not removed from the queue explicitly).
I think the hard limit on running request for CDS is 20, I've no idea what it is for the WEBAPI. So at least one error trap should be in place for the case where people submit > 20 requests. Either give a warning, or cycle through them in batches of 20 (or xyz for WEBAPI).
I did the latter, but thought that the limit was 5 running request (if I submitted more, the rest were kept on queue). So I made a function that made sure that there were no more (and no less) than 5 running requests at once.
The question of failure states is also important. At present I added a "fail_is_error", but I don't think is the way to go for a robust implementation. My guess is that we should try to distinguish between permanent errors and transient ones. The latter being connectivity issues or service hiccups and the latter being malformed requests or any other problem that means that we are certain that the request will not succeed.
For connectivity issues, we could have a retry limit and maybe use curl::has_internet()
. For permanent errors, we could trust the permanent
flag in CDS (although I don't know what it really means) or just take any error returned by the APIs (like "failed", "rejected" and "aborted") as permanent.
Some travis are failing due to API rate limit exceeded
. ¿Any idea on what's going on?
There might not only be a limit on concurrent calls but also on how many you can make. I have this at times when I pushed many patches (generating tons of traffic as CI runs through the tests).
Ok. Normal pains of testing APIs, then. I'll restart the jobs and give them another chance. Tests are passing on my machine with my user.
Hi again! It's been *checks clock* more than an year! 😱 since I tinkered with this and there were a lot of commits to master in the meantime. But I sat down and basically re-implemented the R6 stuff from the latest commit. This now lives here: https://github.com/bluegreen-labs/ecmwfr/tree/R6-classes
This implementation is a bit better because it captures the logic of each service in its own class, which inherits the basic methods from a parent "service" class. This should make it simpler to implement and maintain each service, I think.
I'm not familiar with the ADS service, but from what I've seen of the code, the only difference is here:
Therefore, the ads service actually inherits from the cds service 😎. I don't have a user to run test, though.
The only pending issue is implementing wf_delete()
and fixing some issues with backwards compatibility that surfaced when testing.
With this change, the wf_request()
function, when transfer = FALSE
returns the request object without submitting it first to the service.
With this it was very easy to create a relatively smart function that takes a list of requests (e.g. the output of lapply(request_list, function(x) wf_request(x, transfer = FALSE)
) and then submits them in parallel:
Let me know what you think :)
Hi @eliocamp
Thanks for the additions.
The ADS service has unit tests in my master branch, which should run on a pull request (I think, my keys might be expired). As long as unit tests run cleanly everything should be fine.
The 'transfer = FALSE' is basically a provision to submit a query to the server which would be queued online but without the state being tracked in an R session. This behaviour allows large or long queries to be submitted and only downloaded once they finish (after several hours) and provides asynchronous downloads.
Basically you could line up a download from a laptop, to download it later to a production server (HPC). Or if jobs are larger I often start one overnight. I'm not sure this behaviour is maintained in your implementation. I think it should. This stateless query removes the need to juggle tmux / screen or nohup sessions, keeping a session alive (or computers running) just for a download spinner!
I also think the hard limit on the API is 20 parallel requests if I'm not mistaken. So workers
can be set to a preset value of 20 - as default upper limit. There is no reason to limit it to a rate which is much lower I think. I should check the API docs, but it's in there somewhere.
I would also propose wf_batch_request()
as function name. Typos will cause mess ups I fear, better to make it real obvious what function is called. wf_request()
<-> wf_requests()
is just too similar, I had to look twice myself.
[FYI: I'm in the middle of planning a move to Switzerland so I'm not on my A-game for a little while.]
w
The 'transfer = FALSE' is basically a provision to submit a query to the server which would be queued online but without the state being tracked in an R session. This behaviour allows large or long queries to be submitted and only downloaded once they finish (after several hours) and provides asynchronous downloads.
Basically you could line up a download from a laptop, to download it later to a production server (HPC). Or if jobs are larger I often start one overnight. I'm not sure this behaviour is maintained in your implementation. I think it should. This stateless query removes the need to juggle tmux / screen or nohup sessions, keeping a session alive (or computers running) just for a download spinner!
In my implementation you would need to do
request <- wf_request(query, transfer = FALSE)
request$submit()
So, it's not backwards compatible. I'll change that. I'll need to rethink the workflow of batch requests. Perhaps wf_batch_requests()
could take a list of queries instead of a list of unsubmitted requests.
I also think the hard limit on the API is 20 parallel requests if I'm not mistaken. So
workers
can be set to a preset value of 20 - as default upper limit. There is no reason to limit it to a rate which is much lower I think. I should check the API docs, but it's in there somewhere.
Yesterday I set up 4 requests to the CDS service and it only processed 2 at a time (the others remained queued until the first two finished) so that's why I went with two. I don't know exactly how that works, but
I would also propose
wf_batch_request()
as function name. Typos will cause mess ups I fear, better to make it real obvious what function is called.wf_request()
<->wf_requests()
is just too similar, I had to look twice myself.
Good catch.
[FYI: I'm in the middle of planning a move to Switzerland so I'm not on my A-game for a little while.]
Good luck with that!
If it is possible to keep the current transfer behaviour that would be great. A lot of listed issues
filled here are related to breaking changes between versions with respect to how requests are dealt with (renaming some of the variables etc). This is issue creeps up on people as R will update packages on CRAN but won't warn them of these changes.
I used to be able to query more than two downloads, they might have changed the rate limit. Anyway, if this is what your experience has been recently we better peg it at two.
Today I got news that I was able to fix an apartment remotely, which is an absolute win anyway and surely in COVID times. I'm still looking at a quarantine, but at least not in a faceless hotel with nothing but take-out for ten days.
I'm on the situation that I need to download a lot of data which I need to divide into multiple requests. Each request can take in the order of hours (depending on CDS mood that day) but I figured out that you could just send a bunch of request in parallel and let them handle the load as they see fit.
There are many ways of doing this but I went for the easy one of literally parallelising the calls like this:
Do you think it would be worthwhile to handle some kind of multiple requests nativelly? Essentially allowing for the
request
argument to be a list of request?