afsc-assessments / afscdata

An R package for data extraction of AFSC survey and fishery data
https://afsc-assessments.github.io/afscdata/
Other
2 stars 0 forks source link

timestamp datapulls #32

Closed mkapur-noaa closed 10 months ago

mkapur-noaa commented 1 year ago

I think it'd be useful to add a timestamp to the csv filenames when they're written in queries.R, since folks might re-run a call function and not realize they're overwriting an older version of the data. This could be a stripped argument passed to every instance where vroom::write() is called. something like:

timestmp <- gsub(":","_",Sys.time())
 vroom::vroom_write(here::here(year, "data", "raw", 
paste0(timestmp,"-bts_specimen_data.csv")), 
                         delim = ",")
BenWilliams-NOAA commented 1 year ago

@mkapur-noaa Hmmm - issue I see with that is that you then have non-static filenames so all downstream functions need to be uniquely identified. Could add an 'add_date' switch to data files and downstream functions could have an 'alt' file input. I'd prefer to not have variable named files be the default. Seems a workable method to achieve both, thoughts?

Though I've also done this (and maybe it works better?) by setting everything in a different folder (i.e., each model is self contained in a single folder) - gets to the question is it best/better to have all the data stored in a data folder or in individual model folders?

mkapur commented 1 year ago

Yeah, I hear what you're saying. Using something like "grep" for downstream functions would be good but kinda clunky if it's returning >1 value.

WRT individual data folders, I think it's preferable to have the raw data stored once in one place (i.e. with timestamp, if relevant). In theory the .dat file in whatever model folder will act as a record of what data was actually used for that run, and there would be some sort of annotated dataprep script indicating what was done to the raw data to get there. All the more helpful if that script is able to reference a certain timestamped file.

IIRC the NWFSC version gets around this by simply writing a timestamped folder (which is dynamically named) each time the pull function is called. Then the downstream functions are only looking in there. This would avoid needing to use a grep and parse 1 or more returns for each data name and would prevent accidental overwrites. What do you think?

BenWilliams-NOAA commented 1 year ago

Oh also, there is a timestamp currently implemented, it is the last call of the data query q_date(year)

This function in utils.R places a .txt file in the data/raw folder with the query date. Does this address what you are looking for?

mkapur commented 1 year ago

Oh that's cool - I don't currently have access to my fed laptop (long story) so can't confirm whether that .txt was generated by default last time I ran goa_pop. I'd say so long as there is some record that is auto-generated each time the data are pulled that should be sufficient.

BenWilliams-NOAA commented 10 months ago

i think this is addressed, will reopen if not