Speed-up object IO - Githubissues

eqcorrscan / EQcorrscan

Earthquake detection and analysis in Python.

https://eqcorrscan.readthedocs.io/en/latest/

Other

163 stars 85 forks source link

Speed-up object IO #170

Open calum-chamberlain opened 6 years ago

calum-chamberlain commented 6 years ago

A major slow-down for the new object oriented API is IO - mostly this is down to reading and writing large quakeML files. Not only is this slow, but the way ObsPy have this implemented, the catalog is essentially re-created in memory before being dumped, which can take up a lot of memory. So: slow and memory inefficient.

This issue is to remind us that we should test other catalog IO options both for memory and time efficiency, we should also think about what information we really need in events - could be use a more minimal event object internally with just picks and an origin and a magnitude?

calum-chamberlain commented 6 years ago

I wrote a little test here - doesn't look great at the moment! Only QUAKEML supports full catalog IO, we need Obspy 1.1.0 to try the Nordic IO, and we would have to loop for NLLOC_OBS. JSON doesn't support reading. QUAKEML is about 10x slower to read than it is to write. 👎

calum-chamberlain commented 6 years ago

It may be an option to write to multiple files at the same time... Because we are storing in a tar archive, the number of files doesn't matter too much...?

calum-chamberlain commented 6 years ago

So I think that obspy.io.quakeml Upickler could be heaps faster, which would be really nice... Currently it loops over each event in the etree for the quakeml file, but, I think lxml releases the GIL, so it should be able to be done in parallel... Something to think about to contribute to ObsPy?

calum-chamberlain commented 6 years ago

Opened a question issue here for this.

calum-chamberlain commented 6 years ago

Current status - QuakeML IO is slow for large catalogs - I have tried to play with it, but haven't worked out how to make it faster yet...

I think this issue should work on allowing multiple files to be written, and maybe allow the user to define the catalog format (nordic would be faster, but wouldn't contain any of the detail that QuakeML does).

So:

[ ] Allow multiple files to be written and read (to conserve RAM, and possibly allow for parallel IO);
[ ] Allow different catalog formats, but default to QuakeML?
Work on making QuakeML IO faster in ObsPy.

d-chambers commented 6 years ago

hey @calum-chamberlain,

Food for thought:

I have a few projects at work where I consistently convert catalog objects to pandas dataframes. It makes it much easier to work with, and the support for various forms of serialization in pandas is phenomenal. I have certainly not regretted using them as the internal representation of event information, although I do believe QuakeML should always be a supported input/output (but it doesn't have to be the only one).

calum-chamberlain commented 6 years ago

You make an interesting point - I haven't played around with pandas much - do you have to put much work in to convert between catalog objects and dataframes?

d-chambers commented 6 years ago

A bit, but I am happy to share the code with you (I am really hoping to get past all the red tape to put the package that does this on github soon). I will email you a zip of it later today.