Detecting corrupted or incomplete downloads

sigmafelix commented 1 month ago

@mitchellmanware

When running beethoven pipeline in 2022, I found that one (or more) of GEOS-CF chemical file was downloaded incompletely (i.e., the file causing the error was 2MB, which is only one-fortieth in size of typical GEOS-CF chemical files). Post-checking or detection of incomplete files would be helpful for users who want to download a large set of files from the internet.

For this file in trouble, I will replace it with a newly downloaded file. Could you change the write permission of input/geos directory in the team project folder @kyle-messier ?

Considerations

I suggest two approaches.
- One is to use file hashes (e.g., ~~SHA256~~MD5SUM) that are provided by the data source in some cases. If such piece of information is retrievable from JSON or HTTP request header, we could quickly verify the downloaded files with that.
- The other is leveraging summary statistics of downloaded files, which assume that we have quite reliable network then most of the files were downloaded properly. fs package includes many handy functions to summarize files in tibbles. In this case, we could compare each file size with the typical size or a statistic of all downloaded files to indicate which files were probably corrupted or incomplete.
  - A challenge remains in this approaches where file sizes are so heterogeneous that there is no use with statistics of file sizes (e.g., MODIS tiles are drastically different in size depending on the effective data cells or number of NA/NaNs, unlike full space-time grids in modeling products including GEOS-CF and NARR).

mitchellmanware commented 1 month ago

Thanks for bringing this up @sigmafelix. Creating a file size check function, following the first suggested approach, would be relatively simple with the httr::GET and file.size functions.

> head(u)
[1] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4"
[2] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0130z.nc4"
[3] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0230z.nc4"
[4] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0330z.nc4"
[5] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0430z.nc4"
[6] "https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0530z.nc4"
> download.file(
+   u[1],
+   "/Users/manwareme/Desktop/geos_example.nc4"
+ )
trying URL 'https://portal.nccs.nasa.gov/datashare/gmao/geos-cf/v1/ana/Y2023/M09/D01/GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20230901_0030z.nc4'
Content type 'application/octet-stream' length 7418695 bytes (7.1 MB)
==================================================
downloaded 7.1 MB

> file.size("/Users/manwareme/Desktop/geos_example.nc4")
[1] 7418695
> httr::GET(u[1])$headers$`content-length`
[1] "7418695"
> (file.size("/Users/manwareme/Desktop/geos_example.nc4")
+   == as.numeric(httr::GET(u[1])$headers$`content-length`))
[1] TRUE

My immediate concern with this approach is its performance at scale. Retrieving the size with httr::GET is quick for a single URL, but performance slows substantially with a relative small number (n = 24) URLs (equivalent to 1 day worth for GEOS-CF) data.

> microbenchmark(
+   httr::GET(u[1]),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: milliseconds
                 expr       min       lq       mean    median        uq        max neval
      httr::GET(u[1])  119.9206  124.586   364.5255  209.3135  461.9255   906.8822     5
 lapply(u, httr::GET) 3672.5812 3858.853 10505.7787 4478.3872 5308.7727 35210.2997     5
> 3672.5812/119.9206 # min relative performance
[1] 30.62511
> 35210.2997/906.8822 # max relative performance
[1] 38.82566
> 10505.7787/364.5255 # mean relative performance
[1] 28.82042

mitchellmanware commented 1 month ago

Potential performance benefits using httr2 functions, but still only tested with 24 files. I will do some more comparisons between httr and httr2 functions.

> httr2_requester <- function(url) {
+   httr2::request(url) |> httr2::req_perform()
+ }
> microbenchmark(
+   lapply(u, httr2_requester),
+   lapply(u, httr::GET),
+   times = 5
+ )
Unit: seconds
                       expr      min       lq     mean   median       uq       max neval
 lapply(u, httr2_requester) 4.667334 4.713385 5.121353 4.919260  5.50979  5.796994     5
       lapply(u, httr::GET) 4.002419 5.042487 7.500990 6.052272 10.32270 12.085070     5

sigmafelix commented 1 month ago

@mitchellmanware Thank you for sharing the possible solutions. Checking status code of wget at shell script level could be another option (cf: https://stackoverflow.com/questions/2717303/check-wgets-return-value)

mitchellmanware commented 1 week ago

@sigmafelix

Was this addressed in the most recent PR? If not I will include in next round of manuscript-related changes.

sigmafelix commented 1 week ago

@mitchellmanware It is not addressed yet. I think we could proceed the manuscript without this functionality and add it in the next version of the package.

NIEHS / amadeus

Detecting corrupted or incomplete downloads #81

Considerations