SeismicData / pyasdf

Python Interface to ASDF based on ObsPy
http://seismicdata.github.io/pyasdf/
BSD 3-Clause "New" or "Revised" License
53 stars 30 forks source link

Increasingly slow to add events to ASDF dataset #72

Open stepholinger opened 3 years ago

stepholinger commented 3 years ago

I am using pyasdf to store catalogs of detected events, which can include thousands of obpsy events and waveforms.

When adding these objects to ASDF dataset with ds.add_quakeml(), I found that it becomes increasingly slower to add events to the dataset the larger the number of events in the dataset is. I suspect this may be due to checking for redundancy between the new event and all the events already in the dataset.

I had initially designed a workflow that looped through a list of detections and added an obspy event and waveforms associated with the event via the event_id for each detection. This makes some intuitive sense, but becomes very slow when adding thousands of waveforms and events. This appears to be circumvented by first loading the events into an obspy catalog and adding the entire catalog to the ASDF dataset. This is a fine solution, but should probably be mentioned in the documentation as the preferred method.

The attached snippet of code should simply demonstrate the issue.

import obspy
from obspy.core.event import Event
from obspy.core.event import Catalog
from obspy.core.event import Origin
import time
import pyasdf
import matplotlib.pyplot as plt

# open ASDF dataset
ds = pyasdf.ASDFDataSet("test_dataset.h5")

# set up run
num_it = 100
times = []

# add each event individually
for i in range(num_it):

    # start timer
    timer = time.time()

    # make obspy event
    event = Event()
    event.event_type = "ice quake"
    origin = Origin()
    origin.time = obspy.UTCDateTime(2000+i,1,1)
    event.origins = [origin]

    # add event to ASDF dataset
    ds.add_quakeml(event)

    # stop timer
    runtime = time.time() - timer
    times.append(runtime)

# plot results of adding individual events
fig = plt.plot(range(num_it),times)
ax = plt.gca()
ax.set_ylabel("Time to add event (seconds)")
ax.set_xlabel("Number of events in dataset")
plt.show()
print("Total time to add " + str(num_it) + " events individually: " + str(sum(times)) + " seconds")

# put in an obspy catalog first
event_list = []

# start timer
timer = time.time()

for i in range(num_it):

    # make obspy event
    event = Event()
    event.event_type = "ice quake"
    origin = Origin()
    origin.time = obspy.UTCDateTime(2000+i,1,1)
    event.origins = [origin]

    # add event to list
    event_list.append(event)

catalog = Catalog(event_list)
ds.add_quakeml(catalog)

# stop timer
runtime = time.time() - timer

# print result of adding catalog
print("Total time to add " + str(num_it) + " events in catalog: " + str(runtime) + " seconds")

Screenshot from 2021-08-11 16-38-26

savardge commented 3 years ago

Just wanted to say I observed the same issue.

I have a list of a few thousands of events, with for each event a QuakeML file and a mseed waveform file, and I wanted to add in a loop all events and waveforms to a single H5 file. For further processing, I wanted to have each waveforms with an event_id tag, to easily extract the correpsonding event for each waveform, so my code was something like:

with pyasdf.ASDFDataSet(h5file, compression=None, mode='a') as ds:        
    # Loop over event IDs
    for evid in evid_list:

        # get QuakeML file and mseed file corresponding to evid
        ev_file = get_event_file(evid)
        wf_file = get_waveform_file(evid)

        # Read in event with Obspy
        event = obspy.read_events(ev_file)[0]

        # Add Obspy event to pyasdf
        ds.add_quakeml(event)

        # Add waveforms, with event_id specified
        ds.add_waveforms(wf_file, tag="raw_recording", event_id=event)

And each iteration was increasingly slow. It was intractable to run this for thousands of events.

niyiyu commented 3 years ago

Hi Seth and @savardge,

For this problem, I think you would find the reason as you see what's add_quakeml is doing. The key part is here

        old_cat = self.events
        existing_resource_ids = set([_i.resource_id.id for _i in old_cat])
        new_resource_ids = set([_i.resource_id.id for _i in cat])
        intersection = existing_resource_ids.intersection(new_resource_ids)
        if intersection:
            msg = (
                "Event id(s) %s already present in ASDF file. Adding "
                "events cancelled"
            )
            raise ValueError(msg % ", ".join(intersection))
        old_cat.extend(cat)

        self.events = old_cat

As you may see, pyasdf will do an intersection between the previous existing catalog and the new one, making sure there's no overlap between two catalog. These are quite not a problem as python has been well optimized (at least). If we do look deeper which step in add_quakeml is costing more time, here's the test. Add 500 events one by one.

# Add quakeml one at a time, for 500 event.
for it in range(num_it):

    timer = time.time()
    event = Event()
    event.event_type = "ice quake"
    origin = Origin()
    origin.time = obspy.UTCDateTime(100+1,1,1)
    event.origins = [origin]
    t[0, it] = time.time() - timer

    # Below is basically what's happening when you use add_quakeml.
    timer = time.time()
    old_cat = ds.events
    t[1, it] = time.time() - timer

    timer = time.time()
    existing_resource_ids = set([_i.resource_id.id for _i in old_cat])
    new_resource_ids = set([_i.resource_id.id for _i in Catalog([event])])
    t[2, it] = time.time() - timer

    timer = time.time()
    intersection = existing_resource_ids.intersection(new_resource_ids)
    t[3, it] = time.time() - timer

    timer = time.time()
    if intersection:
        msg = (
            "Event id(s) %s already present in ASDF file. Adding "
            "events cancelled"
        )
        raise ValueError(msg % ", ".join(intersection))
    old_cat.extend(Catalog([event]))
    t[4, it] = time.time() - timer

    timer = time.time()
    ds.events = old_cat
    t[5, it] = time.time() - timer

image

Here you could see that, reading/writing catalog from/to pyasdf is where more time is spent. Note that y-axis is log scale, so reading time will soon explode... By further looking at how pyasdf read catalog from ASDF here, it seems that obspy.read_envets function is the ultimate reason for this increasing time: ObsPy parsed the entire QuakeML...

I hope there's better way to append new events like storing each event into a single dataset. Before that, you may still want to save the entire catalog for only one time. Hope this helps!

Best, Yiyu

niyiyu commented 3 years ago

Also see this issue at https://github.com/obspy/obspy/issues/1910