SeismicData / pyasdf

Python Interface to ASDF based on ObsPy
http://seismicdata.github.io/pyasdf/
BSD 3-Clause "New" or "Revised" License
53 stars 30 forks source link

Two problems for event catalog in pyasdf #42

Closed NoisyLeon closed 6 years ago

NoisyLeon commented 6 years ago

Hi, I'm using pyasdf as database for my data processing and encountered two problems.

1. Manipulating catalog with a large number of events is extremely slow in pyasdf. For example, below is a quakeml file with more than 10000 events https://github.com/NoisyLeon/DataRequest/blob/master/alaska_2017_aug.ml

we can add the catalog to ASDF

import pyasdf
dset = pyasdf.ASDFDataSet('test001.h5')
dset.add_quakeml('/path_to_alaska_2017_aug.ml')

Then the catalog can be accessed through dset.events It is extremely slow to manipulate this catalog, even a simple 'print dset.events[0]' can take a long time

However, in obspy, it's not the case.

import obspy
cat = obspy.read_events('/path_to_alaska_2017_aug.ml')

It is still slow reading the file, but once cat is in memory, manipulating it is much faster.

You can compare the speed of 'print cat[0]' and 'print dset.events[0]'

2. Some information is lost if catalog is stored in ASDF. For example, preferred_origin() and preferred_magnitude() is lost once the catalog is stored in ASDF.

Because of the two problems above, I have to always store my event catalog in another quakeml file while not using the catalog from ASDF file, which is very inconvenient.

Is there any way to improve these?

Thanks, Lili

krischer commented 6 years ago

Hi Lili,

  1. Manipulating catalog with a large number of events is extremely slow in pyasdf.

Accessing dset.event causes pyasdf to open the embedded QuakeML ifile n ASDF and parses it to an obspy.Inventory object. This is not cached as it would require some painful bookkeeping to make sure it is still up to date. Thus the extraction + parsing is repeated each time you access dset.events. Just doing

cat = dset.events

and subsequently working with cat should make this a lot faster.

  1. Some information is lost if catalog is stored in ASDF.

I cannot reproduce this - can you paste a short example that demonstrates this?

In [1]: import obspy

In [2]: import pyasdf

In [3]: cat = obspy.read_events()

In [4]: cat[0].preferred_origin()
Out[4]:
Origin
            resource_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
                   time: UTCDateTime(2012, 4, 4, 14, 21, 42, 300000) [uncertainty=0.0]
              longitude: 79.689
               latitude: 41.818
                  depth: 1000.0 [uncertainty=0.0]
             depth_type: 'from location'
              method_id: ResourceIdentifier(id="smi:eu.emsc-csem/origin_method/NA")
                quality: OriginQuality(used_station_count=16, standard_error=0.0, azimuthal_gap=231.0, minimum_distance=2.45, maximum_distance=53.03)
     origin_uncertainty: OriginUncertainty(min_horizontal_uncertainty=0.0, max_horizontal_uncertainty=0.0, azimuth_max_horizontal_uncertainty=0.0)
        evaluation_mode: 'manual'
          creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))

In [5]: cat[0].preferred_magnitude()
Out[5]:
Magnitude
        resource_id: ResourceIdentifier(id="quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646")
                mag: 4.4 [uncertainty=0.0]
     magnitude_type: 'mb'
          origin_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
      station_count: 0
      creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))

In [6]: ds = pyasdf.ASDFDataSet("test.e")

In [7]: ds.add_quakeml(cat)

In [8]: ds.events[0].preferred_origin()
Out[8]:
Origin
            resource_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
                   time: UTCDateTime(2012, 4, 4, 14, 21, 42, 300000) [uncertainty=0.0]
              longitude: 79.689
               latitude: 41.818
                  depth: 1000.0 [uncertainty=0.0]
             depth_type: 'from location'
              method_id: ResourceIdentifier(id="smi:eu.emsc-csem/origin_method/NA")
                quality: OriginQuality(used_station_count=16, standard_error=0.0, azimuthal_gap=231.0, minimum_distance=2.45, maximum_distance=53.03)
     origin_uncertainty: OriginUncertainty(min_horizontal_uncertainty=0.0, max_horizontal_uncertainty=0.0, azimuth_max_horizontal_uncertainty=0.0)
        evaluation_mode: 'manual'
          creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))

In [9]: ds.events[0].preferred_magnitude()
Out[9]:
Magnitude
        resource_id: ResourceIdentifier(id="quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646")
                mag: 4.4 [uncertainty=0.0]
     magnitude_type: 'mb'
          origin_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
      station_count: 0
      creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))
NoisyLeon commented 6 years ago

Hi Lion, Thanks for the quick reply. I checked my code again and this is what I actually did:

cat=dset.events.copy()
print cat[0].preferred_origin()

I noticed that the preferred_origin is lost only when the catalog is huge. If you use the ASDF file from this link: https://github.com/NoisyLeon/NoisePy/blob/master/cat_inv_Alaska_2017_aug.h5

And try the following:

import pyasdf
dset = pyasdf.ASDFDataSet('/pathto_cat_inv_Alaska_2017_aug.h5')
cat=dset.events.copy()
print cat[0].preferred_origin()

The output is 'None' however, as you suggested, if I use "cat = dset.events" instead of "cat=dset.events.copy()", the preferred_origin information can be retrieved.

I think this easily solves my problem.

Thanks, Lili

krischer commented 6 years ago

Hi Lili,

the .copy() is actually not necessary because all events will have already be read from the file when you call .events and now only reside in memory (plus removing the .copy() call should make it twice as fast).

Large QuakeML files are slow to handle but this is an ObsPy issue (https://github.com/obspy/obspy/issues/1910) and honestly very hard to fix so I'm not sure if we can do a lot about it. Also 10000 events is really a LOT for QuakeML. If you only need some specific information from QuakeML files and if its time critical for your application you could directly use lxml to extract some things. Let me know if you want to discuss this.

I cannot reproduce this missing preferred origin. Are you still running ObsPy 1.0.3? If yes - try updating to 1.1.0 and this particular issue should be fixed (https://github.com/obspy/obspy/pull/1644). But in any case - you don't need the .copy() and then it should work.

Cheers!

Lion

NoisyLeon commented 6 years ago

Hi Lion, I updated ObsPy to 1.1.0 and this problem disappears.

Thanks! Lili

krischer commented 6 years ago

Great :)