`wf_request_batch()` with `transfer=FALSE`

bluegreen-labs / ecmwfr

Interface to the public ECMWF API Web Services

https://bluegreen-labs.github.io/ecmwfr/

Other

106 stars 30 forks source link

`wf_request_batch()` with `transfer=FALSE` #103

Closed Rafnuss closed 2 years ago

Rafnuss commented 2 years ago

Is is possible to have batch request without transfer?

The documentation for wf_request() and wf_request_batch() reads:

Stage a data request, and optionally download the data to disk. Alternatively you can only stage requests, logging the request URLs to submit download queries later on using wf_transfer.

But as I understand it, wf_request_batch() doesn't have an option to just stage the request. Is this correct? Or am I missing something?

eliocamp commented 2 years ago

Yes, you're correct. Right now wf_request_batch() submits workers number of requests and downloads each one when finished. How would you view only staging requests working

Rafnuss commented 2 years ago

My ideal code would look something like:

requests = wf_request_batch(request_list, transfer=F)
# some time later...
wf_transfer(requests)

instead of

for (i_req in seq_len(request_list)){
  requests[i_req] = wf_request(request_list[i_req], transfer=F)
}
# some time later...
for (i_req in seq_len(requests)){
  wf_transfer(requests[i_req])
}

But maybe this is quite a specific need that nobody else share...

eliocamp commented 2 years ago

So you're requesting all the requests at once and then downloading when they are done? The added value of wf_request_batch() is the build-in queue to ensure that you're only sending a maximum number of request so they are not queued on the server end.

For your usecase I'd use req <- lapply(request_list, wf_request, transfer = FALSE) and then lapply(req, wf_transfer).

@khufkens what do you think?

khufkens commented 2 years ago

Correct, assuming that you don't exceed the maximum number of allowed parallel requests.

The recent work of @eliocamp explicitly addresses the latter, monitoring the queue to download and submit new requests as slots free up. So as long as you colour within the lines the proposed fix (above) should work.

Rafnuss commented 2 years ago

Ok, thanks for your answers. lapply() is a much more consise version of my suggestion indeed.

In my case I could have up to 100+ requests to make (of very small files). This is only done once in the overall process of my code. So, my thinking would be to make all requests at once and wait a couple of hours and then download them all. I've been using wf_request_batch() for cases with a few requests (<30), but I thought that is would be nice to have the R console free to do other think while waiting for the case with more requests. What do you think?

khufkens commented 2 years ago

Just submit it as a job! Either in a separate terminal (if you are using no IDE, or using the job interface in RStudio). I mostly let jobs like this run in the background in RStudio, or when using an HPC they run as proper job in the HPC queue.

But yes, best to download everything in one pass if you don't need dynamic access

khufkens commented 2 years ago

For reference:

https://solutions.rstudio.com/r/jobs/

Rafnuss commented 2 years ago

Ok, yes, sounds like a good plan. I'm not super familiar with jobs. But for my case, would you be using the job_name in wf_request(. , transfer=T, job_name="test") or write a script with wf_request_batch() and start it with rstudioapi::jobRunScript() ?

khufkens commented 2 years ago

That's effectively the same thing.

I often call things from within RStudio itself as I often lump in some post/pre-processing.

eliocamp commented 2 years ago

Bear in mind that lapply(list, wf_request, job_name = "test") wont' really work, as it will create 100+ jobs with the same name. I think a better alternative for your use case might be to write a small script with lapply and wf_request and then run the script as a job or even run it in a different R session on the console.

Rafnuss commented 2 years ago

Sounds good. Maybe I'll keep wf_request_batch() in my function (standard case should be 10-30 requests), and then call this function as a job with https://github.com/lindeloev/job/ in the case that there are more requests. Thanks for your help!

khufkens commented 2 years ago

Ok, I'll close this now.

btw. @Rafnuss nice work with the pressure based geolocation work.