Stack overwrites previous data? (plus other issues) - Githubissues

ROBelgium / MSNoise

A Python Package for Monitoring Seismic Velocity Changes using Ambient Seismic Noise | http://www.msnoise.org

European Union Public License 1.1

178 stars 83 forks source link

Stack overwrites previous data? (plus other issues) #371

Open asyates opened 1 month ago

asyates commented 1 month ago

Taking a look at stacking code, it seems like the previous stacked data would always be overwritten by new days (e.g. with flag='T'), since overwrite is set to True by default.

xr_save_ccf(sta1, sta2, components, filterid, mov_stack, taxis, xx, overwrite=True)

I ran a quick test also to confirm this is happening. Is this intended behaviour? i.e., if you were running in real-time... the stacks directory would always just contain stacked CCFs for most recent data.

Just thinking that perhaps the idea is to continue through and process dv/v, update the output files (assuming these are not overwritten but just inserting new result) and not worry about keeping old stacked data (since its all contained within cross-correlation directory). But then, the plotting functions should also use cross-correlation directory, not the stacked data.

asyates commented 1 month ago

On a similar note re. stacking (subdaily) and real-time processing, if I imagine pulling in one day of data at a time and stacking subdaily, e.g. 12 hr with 1 hr sampling rate.... based on the current implementation, would I expect that the first 11 stacks are constructed using less data? e.g. first stack only one hour, second only two hours etc.

i see some old comment/code from msnoise 1.6 (below) suggesting that we should be pulling the updated days - mov_stack, but i don't see this in the current get_results_all function? Appears to me, just from the code, that it is reading only the .h5 files for individual days where flag='T', not any previous (so would not stack properly). Am I wrong?

                    # TODO: load only the updated dates +- max(mov_stack)+1
                    # Note: this would no longer be needed if the stack is h5
                    nstack, stack_total = get_results(
                        db, sta1, sta2, filterid, components, datelist, format=format, params=params)

asyates commented 1 month ago

Also, current implementation presumably assumes that new CCF jobs are adjacent in time, but won't necessarily always be the case (often is, but if someone got access to more data for different time periods... these would all go into the same pandas database prior to rolling average).

I guess simply resampling to corr_duration and filling with nan prior to applying rolling mean (stacking) will do the trick.

ThomasLecocq commented 1 month ago

Re nan+mov: question is also if we re-mask after roll? I would say yes, don't wanna create data.

asyates commented 1 month ago

Yeah i think thats a good idea.

I was similarly thinking, for fixing issue where not pulling in prior data for stacking, that could pull in additional ccfs via get_results_all (by modifying the datetime.datetime list input), and similarly remove the additional 'past' days post-roll.

asyates commented 1 month ago

In fact, it's okay re. rolling with gaps. I didn't notice a resample line already exists, so its already filling with nan and then dropping nan after. So just the issue of using past data to fix.

ThomasLecocq commented 1 month ago

Hehe ok! And reading "enough" data to allow the rolling stats!

asyates commented 1 month ago

This seems very problematic also:

                c = get_results_all(db, sta1, sta2, filterid, components, days, format="xarray")
                # print(c)
                # dr = xr_save_ccf(sta1, sta2, components, filterid, 1, taxis, c)
                dr = c
                if stype == "ref":
                    start, end, datelist = build_ref_datelist(db)
                    start = np.array(start, dtype=np.datetime64)
                    end = np.array(end, dtype=np.datetime64)
                    _ = dr.where(dr.times >= start, drop=True)
                    _ = _.where(_.times <= end, drop=True)
                    # TODO add other stack methods here! using apply?
                    _ = _.mean(dim="times")
                    xr_save_ref(sta1, sta2, components, filterid, taxis, _)
                    continue

looks to me like the reference will be built only with recently processed CCFs. Quick check using print statement seems to confirm this (printing '_' to see if contains only newly processed data).

ThomasLecocq commented 1 month ago

very bad idea indeed :-) as said, the stack2 stuff was written on the "process all archive", which is a bad idea. Actually, I thought of moving the ref stack out of the "jobs" world, if you run stack -r you'll re-compute the REF, that's it, no need to check for jobs or else. That way, the STACK jobs will only be used for the -m or -s jobs. (to check, do -S have any meaning now that you can do the ('30d', '30d') mov_stack ?

asyates commented 1 month ago

Think that's a good idea, running the stack reset between the commands always felt a bit untidy. I can make the change and update the pull request.

Actually, i'd forgotten about the existence of the -s job ^^. I guess because easy to extract anyway with rolling output; but would also assume now redundant if tuple mov_stack working as intended.

ThomasLecocq commented 1 month ago

It'd be amazing if you could do all this indeed in that PR! Please make sure to adapt the documentation :) (multiple places, including the how-tos :) )