Open sigmafelix opened 6 months ago
Thanks for bringing this up @sigmafelix. Creating a file size check function, following the first suggested approach, would be relatively simple with the httr::GET
and file.size
functions.
> head(u)
[1] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4"
[2] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0130z.nc4"
[3] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0230z.nc4"
[4] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0330z.nc4"
[5] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0430z.nc4"
[6] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0530z.nc4"
> download.file(
+ u[1],
+ "/Users/manwareme/Desktop/geos_example.nc4"
+ )
trying URL 'https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4'
Content type 'application/octet-stream' length 7418695 bytes (7.1 MB)
==================================================
downloaded 7.1 MB
> file.size("/Users/manwareme/Desktop/geos_example.nc4")
[1] 7418695
> httr::GET(u[1])$headers$`content-length`
[1] "7418695"
> (file.size("/Users/manwareme/Desktop/geos_example.nc4")
+ == as.numeric(httr::GET(u[1])$headers$`content-length`))
[1] TRUE
My immediate concern with this approach is its performance at scale. Retrieving the size with httr::GET
is quick for a single URL, but performance slows substantially with a relative small number (n = 24) URLs (equivalent to 1 day worth for GEOS-CF) data.
> microbenchmark(
+ httr::GET(u[1]),
+ lapply(u, httr::GET),
+ times = 5
+ )
Unit: milliseconds
expr min lq mean median uq max neval
httr::GET(u[1]) 119.9206 124.586 364.5255 209.3135 461.9255 906.8822 5
lapply(u, httr::GET) 3672.5812 3858.853 10505.7787 4478.3872 5308.7727 35210.2997 5
> 3672.5812/119.9206 # min relative performance
[1] 30.62511
> 35210.2997/906.8822 # max relative performance
[1] 38.82566
> 10505.7787/364.5255 # mean relative performance
[1] 28.82042
Potential performance benefits using httr2
functions, but still only tested with 24 files. I will do some more comparisons between httr
and httr2
functions.
> httr2_requester <- function(url) {
+ httr2::request(url) |> httr2::req_perform()
+ }
> microbenchmark(
+ lapply(u, httr2_requester),
+ lapply(u, httr::GET),
+ times = 5
+ )
Unit: seconds
expr min lq mean median uq max neval
lapply(u, httr2_requester) 4.667334 4.713385 5.121353 4.919260 5.50979 5.796994 5
lapply(u, httr::GET) 4.002419 5.042487 7.500990 6.052272 10.32270 12.085070 5
@mitchellmanware Thank you for sharing the possible solutions. Checking status code of wget
at shell script level could be another option (cf: https://stackoverflow.com/questions/2717303/check-wgets-return-value)
@sigmafelix
Was this addressed in the most recent PR? If not I will include in next round of manuscript-related changes.
@mitchellmanware It is not addressed yet. I think we could proceed the manuscript without this functionality and add it in the next version of the package.
@mitchellmanware
When running
beethoven
pipeline in 2022, I found that one (or more) of GEOS-CF chemical file was downloaded incompletely (i.e., the file causing the error was 2MB, which is only one-fortieth in size of typical GEOS-CF chemical files). Post-checking or detection of incomplete files would be helpful for users who want to download a large set of files from the internet.For this file in trouble, I will replace it with a newly downloaded file. Could you change the write permission of input/geos directory in the team project folder @kyle-messier ?
Considerations
SHA256MD5SUM) that are provided by the data source in some cases. If such piece of information is retrievable from JSON or HTTP request header, we could quickly verify the downloaded files with that.fs
package includes many handy functions to summarize files intibble
s. In this case, we could compare each file size with the typical size or a statistic of all downloaded files to indicate which files were probably corrupted or incomplete.