Open stepholinger opened 3 years ago
Just wanted to say I observed the same issue.
I have a list of a few thousands of events, with for each event a QuakeML file and a mseed waveform file, and I wanted to add in a loop all events and waveforms to a single H5 file. For further processing, I wanted to have each waveforms with an event_id tag, to easily extract the correpsonding event for each waveform, so my code was something like:
with pyasdf.ASDFDataSet(h5file, compression=None, mode='a') as ds:
# Loop over event IDs
for evid in evid_list:
# get QuakeML file and mseed file corresponding to evid
ev_file = get_event_file(evid)
wf_file = get_waveform_file(evid)
# Read in event with Obspy
event = obspy.read_events(ev_file)[0]
# Add Obspy event to pyasdf
ds.add_quakeml(event)
# Add waveforms, with event_id specified
ds.add_waveforms(wf_file, tag="raw_recording", event_id=event)
And each iteration was increasingly slow. It was intractable to run this for thousands of events.
Hi Seth and @savardge,
For this problem, I think you would find the reason as you see what's add_quakeml is doing. The key part is here
old_cat = self.events
existing_resource_ids = set([_i.resource_id.id for _i in old_cat])
new_resource_ids = set([_i.resource_id.id for _i in cat])
intersection = existing_resource_ids.intersection(new_resource_ids)
if intersection:
msg = (
"Event id(s) %s already present in ASDF file. Adding "
"events cancelled"
)
raise ValueError(msg % ", ".join(intersection))
old_cat.extend(cat)
self.events = old_cat
As you may see, pyasdf will do an intersection between the previous existing catalog and the new one, making sure there's no overlap between two catalog. These are quite not a problem as python has been well optimized (at least). If we do look deeper which step in add_quakeml is costing more time, here's the test. Add 500 events one by one.
# Add quakeml one at a time, for 500 event.
for it in range(num_it):
timer = time.time()
event = Event()
event.event_type = "ice quake"
origin = Origin()
origin.time = obspy.UTCDateTime(100+1,1,1)
event.origins = [origin]
t[0, it] = time.time() - timer
# Below is basically what's happening when you use add_quakeml.
timer = time.time()
old_cat = ds.events
t[1, it] = time.time() - timer
timer = time.time()
existing_resource_ids = set([_i.resource_id.id for _i in old_cat])
new_resource_ids = set([_i.resource_id.id for _i in Catalog([event])])
t[2, it] = time.time() - timer
timer = time.time()
intersection = existing_resource_ids.intersection(new_resource_ids)
t[3, it] = time.time() - timer
timer = time.time()
if intersection:
msg = (
"Event id(s) %s already present in ASDF file. Adding "
"events cancelled"
)
raise ValueError(msg % ", ".join(intersection))
old_cat.extend(Catalog([event]))
t[4, it] = time.time() - timer
timer = time.time()
ds.events = old_cat
t[5, it] = time.time() - timer
Here you could see that, reading/writing catalog from/to pyasdf is where more time is spent. Note that y-axis is log scale, so reading time will soon explode... By further looking at how pyasdf read catalog from ASDF here, it seems that obspy.read_envets function is the ultimate reason for this increasing time: ObsPy parsed the entire QuakeML...
I hope there's better way to append new events like storing each event into a single dataset. Before that, you may still want to save the entire catalog for only one time. Hope this helps!
Best, Yiyu
Also see this issue at https://github.com/obspy/obspy/issues/1910
I am using pyasdf to store catalogs of detected events, which can include thousands of obpsy events and waveforms.
When adding these objects to ASDF dataset with ds.add_quakeml(), I found that it becomes increasingly slower to add events to the dataset the larger the number of events in the dataset is. I suspect this may be due to checking for redundancy between the new event and all the events already in the dataset.
I had initially designed a workflow that looped through a list of detections and added an obspy event and waveforms associated with the event via the event_id for each detection. This makes some intuitive sense, but becomes very slow when adding thousands of waveforms and events. This appears to be circumvented by first loading the events into an obspy catalog and adding the entire catalog to the ASDF dataset. This is a fine solution, but should probably be mentioned in the documentation as the preferred method.
The attached snippet of code should simply demonstrate the issue.