Ocean-Data-Lab / ooipy

Python library and demo code for processing and visualization of data from Ocean Observatories Initiative (OOI)
https://ooipy.readthedocs.io/en/latest/
MIT License
10 stars 7 forks source link

multithreading possibly causing problems with gapless merge #181

Open jdduprey opened 3 weeks ago

jdduprey commented 3 weeks ago

I believe this line returns 5 minute chunks in a more-or-less random order each time its run. st_list = __map_concurrency(__read_mseed, valid_data_url_list, verbose=verbose)

If I'm understanding this loop correctly, it relies on st_list and valid_data_url_list being in the same order to then set the metadata of st using the valid_url_list[k] filename.

for k, st in enumerate(st_list):
            # check if multiple traces in stream
            print(len(st))
            if len(st) == 1:
                continue

            # count total number of points in stream
            npts_total = 0
            for tr in st:
                npts_total += tr.stats.npts

            # if valid npts, merge traces w/o consideration to gaps
            if npts_total / sampling_rate in [
                300,
                299.999,
                300.001,
            ]:  # must be 5 minutes of samples
                # NOTE it appears that npts_total is nondeterminstically off by ± 64 samples. I have
                #   idea why, but am catching this here. Unknown what downstream effects this could have

                if verbose:
                    print(st[0].stats)
                    print(f"gapless merge for {valid_data_url_list[k]}")
                data = []
                for tr in st:
                    data.append(tr.data)
                data_cat = np.concatenate(data)

                stats = dict(st[0].stats)
                stats["starttime"] = UTCDateTime(valid_data_url_list[k][-33:-6])
                stats["endtime"] = UTCDateTime(stats["starttime"] + timedelta(minutes=5))
                stats["npts"] = len(data_cat)

                st_list[k] = Stream(traces=Trace(data_cat, header=stats))

I'm not sure how obspy sorts traces under the hood, but could st_list being highly out of order also be increasing Stream.merge() times?

jdduprey commented 3 weeks ago

A quick fix could be to set the start time and end time of the concatenated trace to the starttime of the st[0] and the endtime of st[-1] prior to the merge.