Keep track of files used by read* functions

raymondben commented 7 years ago

In light of the bowerbird discussion around reproducibility , it would be worth thinking about whether raadtools (or raadfiles) can keep track of files that are read by a user through calls to read* functions. If we can do that, then we can build a minimal list of files needed to reproduce a given analysis. If raadtools can't do this, we can still build a list of files needed, it just won't be a minimal list (it'll be all files in the directory associated with a given data source, not just the files from that data source that were actually used. In some cases the former might be a much larger set than the latter).

mdsumner commented 7 years ago

This is readily available in one sense, because each read function builds its own internal set of "files" that are filtered by normalizing the input date/s against the target set. What is the most useful way to record that is a very unclear to me, but a package-database for institutional installs could work - record the function call and the file set, and maybe a summary of the output object ...

There's a metadata() facility in raster, but some functions will return data frames and other types. It's also a different kind of behaviour to get the total file set, then filter on that, then pass those dates to the read function. Keeping a brick with multiple metadata slots in sync with its layers wouldn't be trivial, so it really matters how far this responsibility needs to go.

I have considered having an object for each "data set", it represents the current file set and the user filters on that. Then it would be more naturally recorded what the target files were, and the final user decisions then just get "baked-in" as a raster, or a data frame, or whatever. This is more or less what the R6 blue-sky experiment is attempting, and that was surprisingly straightforward to do. It matters what record the user really needs too, the 20 times they read a test or the final read where they figured out what was going to work.

It would be nice if readsst(latest = TRUE) did get recorded somehow, what was the latest date and what file did it hit - it's a requirement for "getting the latest" across multiple sources, because amongst ice, sst, currents and so on will have one that is tardier than the rest and finding that date requires a bit of interaction.

I feel like this really needs a "phone-home" facility that's independent of bowerbird or raadtools, a way for functions to "register" their important artefacts that are otherwise transient. I think there are packages in R for functions to register information to a global pool (?), but at that level it's a shared need with other projects.

raymondben commented 7 years ago

I think I'm viewing this from a slightly different perspective: as a user, I'd ideally like to do something analogous to:

ic <- readice(..., tell_me_which_files=TRUE)

and have it tell me which files are needed to fulfil that particular data query. Then I have the option of e.g. copying those data files off into a docker container or DOI'd data set so that my analysis can be reproduced by others.

mdsumner commented 7 years ago

Will do!

mdsumner commented 7 years ago

A related thought, I'm revisiting the NCEP2 6-hourly winds (uwnd and vwnd), these bring up interesting challenges :)

u and v are in different files
each file has a year's worth of 6hourly data
the date-time of each slice is not discoverable from the file name, though each has calendar-year

Each year may have different numbers of time slices. Presumably this is leap years, late-start years and / or years-to-yet-finish. (This is a general need that we have to handle anyway, it could be anything. )

## 39 * 2 (uwnd, vwnd) NCEP2 files as at 2017-07-20
unique(unlist(lapply(files$fullname, function(x) tidync(x)$dimension$length[4])))
[1] 1460 1464  724

Previously, raadtools::windfiles did all this processing upfront,

find uwnd|vwnd, split as "ufullname", "vfullname" columns
explode each file-row into one for each time-slice "band" column
discover exact date from file metadata "date" column
expect downstream read to understand ufullname/vfullname by reading from two files, and by passing "band" along to raster(), or subsetting a stack/brick

That was slow, because there were tens of thousands of slices to get date-times for - (though it could have been optimized by offset/scale or caching).

Now, I think this makes sense, first "raadfiles":

individual functions, e.g. ncep2_uwnd_6hr_files
put a Date on each file, ISOdate(year, 1, 1)
return the 'uwnd' 6hr files, for generic use

Then raadtools implements 'readwind' with

check user args either of "uonly", "vonly"
obtain one or both of ncep2_uwnd_6hr_files(), ncep2_vwnd_6hr_files()
trim to input date interval (maybe with an inclusive buffer)
harvest full date from every file (not every available year)
trim to the date interval more precisely

I think that makes sense, it means raadfiles is always fast, it just pulls the source file names out at a very atomic level from the in-memory cache. It allows implementations to build up a file set of interest from atomic functions rather than wading through a larger mess.

(raadtools might cache the date-of-every-slice information independently)

Provenance-wise it puts the record back at 'raadfiles', but raises the issue of whether "file" is the right level if we care about this second-level provenance. :) But this is already better overall I think.

mdsumner commented 7 years ago

I'm mostly thinking out aloud here. This docker-repro idea is really powerful! I'll keep it in the mix as a requirement, I think it doesn't matter where the file-filter action happens, as long as it is predictably separable from the read- function (I really don't want that "returnfiles" argument behaviour, so we stay type-safe). It would be nice if "readsst(dates, ...)" used a lower level tool set that returned the relevant files but it means we have to keep the file-time-interval and the slice-in-file-interval independent.

It means having a proper toolkit for "resolve(query-times, baseline-times)" which .processFiles was the early hack at, so either that file set gets de-duplicated or the file-filter and slice-filter is completely independent. I think it means we need to get serious about the slice-intervals which might not be straightforward, but is actually just the same-old path vs. vertex-pair-segment thing that's been haunting me.

I'm worried about maybe another rogue-file-set scenario that I've forgotten about.

mdsumner commented 2 years ago

closing through lack of impetus, but linked to related issues

AustralianAntarcticDivision / raadtools

Keep track of files used by read* functions #55