Closed NoisyLeon closed 6 years ago
Hi Lili,
- Manipulating catalog with a large number of events is extremely slow in pyasdf.
Accessing dset.event
causes pyasdf
to open the embedded QuakeML ifile n ASDF and parses it to an obspy.Inventory
object. This is not cached as it would require some painful bookkeeping to make sure it is still up to date. Thus the extraction + parsing is repeated each time you access dset.events
. Just doing
cat = dset.events
and subsequently working with cat
should make this a lot faster.
- Some information is lost if catalog is stored in ASDF.
I cannot reproduce this - can you paste a short example that demonstrates this?
In [1]: import obspy
In [2]: import pyasdf
In [3]: cat = obspy.read_events()
In [4]: cat[0].preferred_origin()
Out[4]:
Origin
resource_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
time: UTCDateTime(2012, 4, 4, 14, 21, 42, 300000) [uncertainty=0.0]
longitude: 79.689
latitude: 41.818
depth: 1000.0 [uncertainty=0.0]
depth_type: 'from location'
method_id: ResourceIdentifier(id="smi:eu.emsc-csem/origin_method/NA")
quality: OriginQuality(used_station_count=16, standard_error=0.0, azimuthal_gap=231.0, minimum_distance=2.45, maximum_distance=53.03)
origin_uncertainty: OriginUncertainty(min_horizontal_uncertainty=0.0, max_horizontal_uncertainty=0.0, azimuth_max_horizontal_uncertainty=0.0)
evaluation_mode: 'manual'
creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))
In [5]: cat[0].preferred_magnitude()
Out[5]:
Magnitude
resource_id: ResourceIdentifier(id="quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646")
mag: 4.4 [uncertainty=0.0]
magnitude_type: 'mb'
origin_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
station_count: 0
creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))
In [6]: ds = pyasdf.ASDFDataSet("test.e")
In [7]: ds.add_quakeml(cat)
In [8]: ds.events[0].preferred_origin()
Out[8]:
Origin
resource_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
time: UTCDateTime(2012, 4, 4, 14, 21, 42, 300000) [uncertainty=0.0]
longitude: 79.689
latitude: 41.818
depth: 1000.0 [uncertainty=0.0]
depth_type: 'from location'
method_id: ResourceIdentifier(id="smi:eu.emsc-csem/origin_method/NA")
quality: OriginQuality(used_station_count=16, standard_error=0.0, azimuthal_gap=231.0, minimum_distance=2.45, maximum_distance=53.03)
origin_uncertainty: OriginUncertainty(min_horizontal_uncertainty=0.0, max_horizontal_uncertainty=0.0, azimuth_max_horizontal_uncertainty=0.0)
evaluation_mode: 'manual'
creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))
In [9]: ds.events[0].preferred_magnitude()
Out[9]:
Magnitude
resource_id: ResourceIdentifier(id="quakeml:eu.emsc/NetworkMagnitude/rts/261020/782484/796646")
mag: 4.4 [uncertainty=0.0]
magnitude_type: 'mb'
origin_id: ResourceIdentifier(id="quakeml:eu.emsc/origin/rts/261020/782484")
station_count: 0
creation_info: CreationInfo(agency_uri=ResourceIdentifier(id="smi:smi-registry/organization/EMSC"), author_uri=ResourceIdentifier(id="smi:smi-registry/organization/NNC"))
Hi Lion, Thanks for the quick reply. I checked my code again and this is what I actually did:
cat=dset.events.copy()
print cat[0].preferred_origin()
I noticed that the preferred_origin is lost only when the catalog is huge. If you use the ASDF file from this link: https://github.com/NoisyLeon/NoisePy/blob/master/cat_inv_Alaska_2017_aug.h5
And try the following:
import pyasdf
dset = pyasdf.ASDFDataSet('/pathto_cat_inv_Alaska_2017_aug.h5')
cat=dset.events.copy()
print cat[0].preferred_origin()
The output is 'None'
however, as you suggested, if I use "cat = dset.events
" instead of "cat=dset.events.copy()
", the preferred_origin information can be retrieved.
I think this easily solves my problem.
Thanks, Lili
Hi Lili,
the .copy()
is actually not necessary because all events will have already be read from the file when you call .events
and now only reside in memory (plus removing the .copy()
call should make it twice as fast).
Large QuakeML files are slow to handle but this is an ObsPy issue (https://github.com/obspy/obspy/issues/1910) and honestly very hard to fix so I'm not sure if we can do a lot about it. Also 10000 events is really a LOT for QuakeML. If you only need some specific information from QuakeML files and if its time critical for your application you could directly use lxml to extract some things. Let me know if you want to discuss this.
I cannot reproduce this missing preferred origin. Are you still running ObsPy 1.0.3
? If yes - try updating to 1.1.0
and this particular issue should be fixed (https://github.com/obspy/obspy/pull/1644). But in any case - you don't need the .copy()
and then it should work.
Cheers!
Lion
Hi Lion, I updated ObsPy to 1.1.0 and this problem disappears.
Thanks! Lili
Great :)
Hi, I'm using pyasdf as database for my data processing and encountered two problems.
1. Manipulating catalog with a large number of events is extremely slow in pyasdf. For example, below is a quakeml file with more than 10000 events https://github.com/NoisyLeon/DataRequest/blob/master/alaska_2017_aug.ml
we can add the catalog to ASDF
Then the catalog can be accessed through dset.events It is extremely slow to manipulate this catalog, even a simple 'print dset.events[0]' can take a long time
However, in obspy, it's not the case.
It is still slow reading the file, but once cat is in memory, manipulating it is much faster.
You can compare the speed of 'print cat[0]' and 'print dset.events[0]'
2. Some information is lost if catalog is stored in ASDF. For example, preferred_origin() and preferred_magnitude() is lost once the catalog is stored in ASDF.
Because of the two problems above, I have to always store my event catalog in another quakeml file while not using the catalog from ASDF file, which is very inconvenient.
Is there any way to improve these?
Thanks, Lili