ThomasLecocq / SeismoRMS

A simple Jupyter Notebook example for getting the RMS of a seismic signal (from PSDs)
European Union Public License 1.1
86 stars 38 forks source link

Qn about caching #21

Open lmoresi opened 4 years ago

lmoresi commented 4 years ago

Nice work Thomas (and others) ...

Here is something I noticed about how you cache the downloaded files.

In STEP3 of SeismoSD.ipynb You currently don't use cached data from "today" but I think this runs the risk that tomorrow you would consider this file to be just fine even if it is the incomplete data from running the notebook in the middle of the day. One simple check might be to check the creation date of the cached file though that is not really a check that the file is not corrupted.

The same argument applies to the downloaded data from step 3 and presumably the npz files in Step 4 too. There is no check to see if the npz file is out of date compared to the mseed file which would help, I think.

ThomasLecocq commented 4 years ago

Yep, just had the same issue... and manually deleted the files to be sure they would be reprocessed... The whole process was originally meant to be run once.

I'm ooooooopen for a solution (os.path.getmtime or else is ok for me)

lmoresi commented 4 years ago

OK - the reason I was looking at this was that I was trying to use github actions to automate running this each day and checking the plots back in - I wondered the best way to cache the data for that. Unclear … I’ll send a PR if I think of anything useful.

Prof Louis Moresi

ThomasLecocq commented 4 years ago

the download/backfill logic is interesting, then for systematic, cron way of doing, I'd use MSNoise. In preparation for MSNoise 2.0 I already merged the PSD calculations. As soon as you "scan" an archive (whatever the way you fill this archive), MSNoise detects new jobs to do and only process those.

FMassin commented 4 years ago

I think using the notebook for this kind of thing is a complicated strategy. I would rather advice to wrap the fdsn.client interface into the module as new alternative mode to the --pqlx. We can also do an SDS interface, for the lucky ones which have direct access to data archive storage...

ThomasLecocq commented 4 years ago

sure thing... the idea was to provide a simple plotter for people.

ThomasLecocq commented 4 years ago

I mean, the elaborated way of handling massive datasets etc, without duplication from SDS archives etc, ... is implemented in MSNoise already. So the notebook complexity shouldn't be too much more expanded, it's not its goal.

lmoresi commented 4 years ago

Yes to all of the discussion - this is a dirty old hack !

I brought this up because of a small in-class project to automatically and on a daily schedule build these plots for a single site using github actions and push them back to the repository so that they are in the readme ( example: https://github.com/ANU-RSES-Education/SeismicNoise_AuSIS_UHS ). This is for the Australian Seismometers in Schools to give the students a chance to see what you guys are up to without needing to run the codes.

I don't have a bulletproof way to do this but it is similar to that requested in issue #23:

1) Make a change in step 3

safety_window = pd.Timedelta('2 days')
today = pd.to_datetime(UTCDateTime.now().date)

# ... existing code 

for day in pbar:
    datestr = day.strftime("%Y-%m-%d")
    fn  = "{}_{}_{}.mseed".format(dataset, datestr, nslc)
    fnz = "{}_{}_{}.npz".format(dataset, datestr, nslc)

    if (today-day > safety_window) and (os.path.isfile(fn) or (os.path.isfile(fnz) and not force_reprocess)):
        pbar.set_description("Using cache - %s" % fn)
        continue
    else:
        pbar.set_description("Fetching    - %s" % fn)
        try: 
           # etc 

2) A corresponding change in step 4

    for mseedid in list(set([tr.id for tr in stall])):
        fn_out = os.path.join("..","data","{}_{}_{}.npz".format(dataset, datestr, mseedid))
        if (today-day > safety_window) and (os.path.isfile(fn_out) and not force_reprocess):
            continue
        st = read(fn_in, sourcename=mseedid)

I can submit a PR if you would like me to.