options for download organizations into files

mdenolle commented 5 years ago

Hi,

We are trying rover in the hope to collect data on the order pf 10sTBs. Here are a few things we (users) would like to see:

I noticed that the ASDF file download is not properly read by pyasdf or by the julia module HDF5.jl. Talking to someone at the DMC on friday, i understand that it's still in development.
the mseed download default is organized with one mseed file per day. These end up being small files and the directory would end up having LOTS of small files. (this is why i prefer ASDF in general). It could be good to also have the option to dump a big mseed file, for instance per station (instead of many small files). I also tend to merge and re-trim the data because mseed files from various arrays/stations end up having different start times. So we could have the option to download concatenated mseed file? (i am an amateur at file formats).

Thanks!

nick-falco commented 5 years ago

Hi @mdenolle, what error are you receiving when trying to read the ASDF file produced by Rover using pyasdf? I just ran a very small data retrieval, and was able to read the asdf.h5 file that was created by Rover.

i.e.

| => rover retrieve IU_ANMO_00_BH1 2017-01-01 2017-01-04 --output-format=asdf --asdf-filename=asdf.h5
| => python
>>> from pyasdf import ASDFDataSet
>>> ds = ASDFDataSet("asdf.h5")
>>> print(ds)
ASDF file [format version: 1.0.2]: 'asdf.h5' (6.4 MB)
    Contains 0 event(s)
    Contains waveform data from 1 station(s).

timronan commented 5 years ago

Hello,

ROVER is designed to downloaded as station-day files. This resolution has been chosen because it has seemed optimize the speed of ROVER downloads, indexing, and file clutter. If we were to chunk the data into large miniseed files (EX: all requested time for one station per file) the returned file could become massive and completely unusable. Furthermore, ROVER's indexing component may not work because we would have to read a file with a highly variable size into memory. This does not seem like a robust design.

We do try and provide tools to help with the database management and exploration. The command rover list-summary prints the retrieved data from the earliest to the latest timespans. rover list-retrieve compares the local index with the requested data remotely available, then displays the difference.

mdenolle commented 5 years ago

Hello and thank you both for your responses.

To @timronan , we are a group that works with years of data and 500+ stations. For these large-N, large-T studies, having many small files is really impractical for I/O performance, and this is why we had turned to ASDF. Now i think that leaving this up to the user to define how big the file can be would be a great addition to ROVER. Note that most laptops can read 1G of data in memory, and that a 100Hz day in mseed 2.2Mb from the examples below. So our code will just reconcatenate them anyway as postprocessing.

To @nick-iris , you are correct that the script works for me. I am trying to modify the rover.config, but having the command line will enable better scripting to get more files. However, I have had 2 inconsistencies in the download for ASDF. Sometimes it works:

(obspy) user@ubuntu:~/TEST_ROVER/data$ time rover retrieve IU_ANMO_10_HHZ 2012-01-01 2012-02-01 --output-format=asdf --asdf-filename=crap.h5 retrieve DEFAULT: ROVER version 1.0.4 - starting retrieve retrieve DEFAULT: Status available at http://127.0.0.1:8000 retrieve DEFAULT: Trying new retrieval attempt 1 of 3. retrieve DEFAULT: Downloading IU_ANMO 2012-001 (N_S 1/1; day 1/31) retrieve DEFAULT: Downloading IU_ANMO 2012-002 (N_S 1/1; day 2/31) retrieve DEFAULT: Downloading IU_ANMO 2012-003 (N_S 1/1; day 3/31) retrieve DEFAULT: Downloading IU_ANMO 2012-004 (N_S 1/1; day 4/31) retrieve DEFAULT: Downloading IU_ANMO 2012-005 (N_S 1/1; day 5/31) download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T00:55:48.388391Z - 2012-01-02T00:58:50.158391Z | 100.0 Hz, 18178 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T04:04:26.148393Z - 2012-01-02T04:08:09.958393Z | 100.0 Hz, 22382 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T04:09:05.038393Z - 2012-01-02T04:12:07.398393Z | 100.0 Hz, 18237 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T11:47:41.178393Z - 2012-01-02T11:51:26.138393Z | 100.0 Hz, 22497 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T14:42:23.058393Z - 2012-01-02T14:45:24.758393Z | 100.0 Hz, 18171 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T21:09:06.848393Z - 2012-01-02T21:12:05.888393Z | 100.0 Hz, 17905 samples' to ASDF. download.36262 DEFAULT: Add 'IU.ANMO.10.HHZ | 2012-01-02T21:18:10.788393Z - 2012-01-02T21:21:12.9m .....

Sometimes after (and after cleaning the data h5 file), in the same terminal, it does not:

(obspy) user@ubuntu:~/TEST_ROVER/data$ rm -rf data/*h5 (obspy) user@ubuntu:~/TEST_ROVER/data$ time rover retrieve IU_ANMO_10_HHZ 2012-01-01 2012-02-01 --output-format=asdf --asdf-filename=crap.h5 retrieve DEFAULT: ROVER version 1.0.4 - starting retrieve retrieve DEFAULT: Status available at http://127.0.0.1:8000 retrieve DEFAULT: Trying new retrieval attempt 1 of 3. retrieve DEFAULT: Retrieval attempt 1 of 3 is complete. retrieve DEFAULT: The initial retrieval attempt resulted in no errors or data downloaded, will verify. retrieve DEFAULT: Trying new retrieval attempt 2 of 3. retrieve DEFAULT: Retrieval attempt 2 of 3 is complete. retrieve DEFAULT: The final retrieval, attempt 2 of 3, made no downloads and had no errors, we are complete. retrieve DEFAULT: retrieve DEFAULT: ----- Retrieval Finished ----- retrieve DEFAULT: retrieve DEFAULT: retrieve DEFAULT: A ROVER retrieve task on ubuntu retrieve DEFAULT: started 2019-07-25T11:22:39 (2019-07-25T15:22:39 UTC) retrieve DEFAULT: has completed in 0.85 seconds retrieve DEFAULT: retrieve DEFAULT: The download for 0 stations totaled 0 bytes, retrieve DEFAULT: with data covering 0 seconds. retrieve DEFAULT: retrieve DEFAULT: A total of 0 downloads were made, with 0 errors (0 on final pass of 2). retrieve DEFAULT: retrieve DEFAULT: Trying new metadata retrieval. retrieve DEFAULT: Fetched 1 summary rows from tsindex_summary table. retrieve ERROR: Skipping station 'IU.ANMO'. Found station 'IU.ANMO' in the tsindex_summary table but not in the ASDF container. retrieve DEFAULT: retrieve DEFAULT: ----- Metadata Retrieval Finished ----- retrieve DEFAULT:

real 0m3.360s user 0m3.767s sys 0m4.513s (obspy) user@ubuntu:~/TEST_ROVER/data$

The second problem I see in ROVER is that the download time for the same data between mseed and ASDF is 10-12 times faster. I suspect that the IRIS server packages better the mseed files since all of the traces (when the data is gappy) is downloaded in one single mseed file, vs I suspect that each trace (between each gap) is downloaded separately with ASDF. It would be more practical for the users to have the ASDF file created on the IRIS end, and then just one download. I am also not sure whether the ASDF file is closed and reopen for each trace download.

Voila, happy to discuss where my bugs are. I am a fan of ROVER, i am just going to be a very heavy user of it and it seems okay to provide feedback. Cheers, Marine

chad-earthscope commented 5 years ago

Hi @mdenolle,

Now i think that leaving this up to the user to define how big the file can be would be a great addition to ROVER. Note that most laptops can read 1G of data in memory, and that a 100Hz day in mseed 2.2Mb from the examples below. So our code will just reconcatenate them anyway as postprocessing.

This is one of the main motivations for adding the option of ASDF output option. We do not anticipate adding any capability to allow alternate file organizations to ROVER for the miniSEED output at this time; there is simply no generally "correct" answer for such organization and arbitrary organization makes ROVER more complex (it's already quite complex!). Instead, we will be offering ASDF (perhaps other HDF5 formats) and provide abstraction interfaces such as portable-fdsnws-dataselect and a direct read module in ObsPy (not yet released, but prepared here: https://github.com/obspy/obspy/pull/2206). For many users this means they do not need to consider the individual miniSEED files themselves.

Sometimes after (and after cleaning the data h5 file), in the same terminal, it does not:

In this case ROVER worked as expected, you already downloaded that data so it does not need to download it again.

(obspy) user@ubuntu:/TEST_ROVER/data$ rm -rf data/*h5
(obspy) user@ubuntu:/TEST_ROVER/data$ time rover retrieve IU_ANMO_10_HHZ 2012-01-01 2012-02-01 --output-format=asdf --asdf-filename=crap.h5

The issue here is that you did not remove the data index, just the data store. In the data directory you will see a timeseries.sqlite file. This database contains an index of the data downloaded whether in miniSEED or ASDF. This index is how ROVER knows what is in its repository, as scanning all the data files each time such information is needed is wholly impractical. To properly remove data from a ROVER repository one would need to remove the index entries and the data files. There is no official way to remove data from a ROVER repository, beyond starting over, as this does not seem a common use case; but it will come up eventually and we are considering a feature enhancement (#21).

The second problem I see in ROVER is that the download time for the same data between mseed and ASDF is 10-12 times faster. I suspect that the IRIS server packages better the mseed files since all of the traces (when the data is gappy) is downloaded in one single mseed file, vs I suspect that each trace (between each gap) is downloaded separately with ASDF. It would be more practical for the users to have the ASDF file created on the IRIS end, and then just one download. I am also not sure whether the ASDF file is closed and reopen for each trace download.

The ASDF output from ROVER is an extra processing step from the normal ASDF workflow of collecting miniSEED from the DMC. There is no difference in transmission (download) or extraction from the data center, between these modes. Also, for a number of reasons, creating the ASDF at the DMC is not a good option, mostly it does not scale well. What your results highlight is a performance issue converting the downloaded miniSEED to ASDF. This is an excellent target for us to investigate and while we do not control the HDF5 and ASDF libraries, hopefully we can improve this part of ROVER.

We greatly appreciate this feedback and encourage you to continue to file tickets for issues with ROVER.

chad-earthscope commented 5 years ago

Regarding the performance different when building ASDF from gappy data, I've posted the issue to the pyasdf project: https://github.com/SeismicData/pyasdf/issues/57

EarthScope / rover

options for download organizations into files #115