NIEHS / amadeus

https://niehs.github.io/amadeus/
MIT License
2 stars 2 forks source link

Detecting corrupted or incomplete downloads #81

Open sigmafelix opened 1 month ago

sigmafelix commented 1 month ago

@mitchellmanware

When running beethoven pipeline in 2022, I found that one (or more) of GEOS-CF chemical file was downloaded incompletely (i.e., the file causing the error was 2MB, which is only one-fortieth in size of typical GEOS-CF chemical files). Post-checking or detection of incomplete files would be helpful for users who want to download a large set of files from the internet.

For this file in trouble, I will replace it with a newly downloaded file. Could you change the write permission of input/geos directory in the team project folder @kyle-messier ?

Considerations

mitchellmanware commented 1 month ago

Thanks for bringing this up @sigmafelix. Creating a file size check function, following the first suggested approach, would be relatively simple with the httr::GET and file.size functions.

> head(u)
[1] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4"
[2] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0130z.nc4"
[3] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0230z.nc4"
[4] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0330z.nc4"
[5] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0430z.nc4"
[6] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0530z.nc4"
> download.file(
+   u[1],
+   "/Users/manwareme/Desktop/geos_example.nc4"
+ )
trying URL 'https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4'
Content type 'application/octet-stream' length 7418695 bytes (7.1 MB)
==================================================
downloaded 7.1 MB

> file.size("/Users/manwareme/Desktop/geos_example.nc4")
[1] 7418695
> httr::GET(u[1])$headers$`content-length`
[1] "7418695"
> (file.size("/Users/manwareme/Desktop/geos_example.nc4")
+   == as.numeric(httr::GET(u[1])$headers$`content-length`))
[1] TRUE

My immediate concern with this approach is its performance at scale. Retrieving the size with httr::GET is quick for a single URL, but performance slows substantially with a relative small number (n = 24) URLs (equivalent to 1 day worth for GEOS-CF) data.

> microbenchmark(
+   httr::GET(u[1]),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: milliseconds
                 expr       min       lq       mean    median        uq        max neval
      httr::GET(u[1])  119.9206  124.586   364.5255  209.3135  461.9255   906.8822     5
 lapply(u, httr::GET) 3672.5812 3858.853 10505.7787 4478.3872 5308.7727 35210.2997     5
> 3672.5812/119.9206 # min relative performance
[1] 30.62511
> 35210.2997/906.8822 # max relative performance
[1] 38.82566
> 10505.7787/364.5255 # mean relative performance
[1] 28.82042
mitchellmanware commented 1 month ago

Potential performance benefits using httr2 functions, but still only tested with 24 files. I will do some more comparisons between httr and httr2 functions.

> httr2_requester <- function(url) {
+   httr2::request(url) |> httr2::req_perform()
+ }
> microbenchmark(
+   lapply(u, httr2_requester),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: seconds
                       expr      min       lq     mean   median       uq       max neval
 lapply(u, httr2_requester) 4.667334 4.713385 5.121353 4.919260  5.50979  5.796994     5
       lapply(u, httr::GET) 4.002419 5.042487 7.500990 6.052272 10.32270 12.085070     5
sigmafelix commented 1 month ago

@mitchellmanware Thank you for sharing the possible solutions. Checking status code of wget at shell script level could be another option (cf: https://stackoverflow.com/questions/2717303/check-wgets-return-value)

mitchellmanware commented 1 week ago

@sigmafelix

Was this addressed in the most recent PR? If not I will include in next round of manuscript-related changes.

sigmafelix commented 1 week ago

@mitchellmanware It is not addressed yet. I think we could proceed the manuscript without this functionality and add it in the next version of the package.