Open d-chambers opened 1 year ago
Thanks, @d-chambers, Excellent feedback.
I will come back to you in more details in the next few days (possibly a few weeks) as I explore potential solutions to the issue you highlighted with DAS and AI.
Regarding DAS, I have a few solutions in mind on how the issues with 10k+ channel could be solved.
I am working on an AI project, I will be able to test it first hand and certainly refine the solution. I think in this case, a webservice providing access to the backend data with a database would be more appropriate (working on that).
I will take a look at the DAS RCN and Obsplus repos in more details.
I have considered Zarr and TileDB. Both formats are great and suited for distributed and parallel processing frameworks, but not as portable as HDF5 - ASDF. Zarr and TileDB are specially suited for server backend to access and serve the data atomically. ASDF/HDF5 is great to neatly store the data from one single event, encapsulating the catalog, inventory and waveform data in one convenient standalone file.
Since the building blocks for producing Zarr and ASDF are the same (QuakeML, StationXML and Streams) it would be simple to write plugins to read and write the data in multiple lossless formats. I have already played and implemented the reading and writing in the Zarr file structure and could easily integrate it into the uQuake library.
Hey @jeanphilippemercier,
Overall I think it’s a good proposal, and years of academic use of ASDF have vetted most aspects of the format already. At NIOSH we primarily use StationXML, QuakeML, and miniseed archives, but using something like this would be a trivial adjustment for us. If microseismic systems support this as an output it would be very nice for humble researchers like myself.
I haven’t actually seen much of the SeisProv parts used in the wild, and it looks like that standard was last updated 10 years ago, so I am not sure if it is still a thing.
DAS usage
I have been thinking a lot about DAS metadata lately, and I think only part of this format will be a good fit. HDF5 is well suited for storing DAS data, after all it’s just a long (in the t axis) grid. However, stationXML is not well suited for capturing metadata (see for example the DAS RCN metadata repo). Part of the reason is that DAS-specific metadata such as cable type, coupling, etc., have to be tacked on. Another reason is that having an entry for each DAS channel can be inefficient, especially when you have 10k+ and all that changes is the coordinate.
Also, I am seeing a move away from HDF5 towards more cloud-friendly storage for large DAS datasets based on ZARR, TileDB, etc., at least in academic circles. (eg this article). Whether that is out of trendiness or real necessity remains to be seen.
Optimization for ML workloads
A strength and weakness of QaukeML is that it is a tree structure. This provides flexibility, but can mean it’s not optimized for array-like processing. For example, if one wants to extract a dataframe of phase picks it can be a bit of a challenge. We spent quite a bit of time on this in obsplus’ dataframe extractors and it resulted in recursive tree-traversal code which is quite complex. I would highly recommend adding convenience functions like these so folks can easily (and hopefully efficiently enough) extract tables for ML workflows and put tables back into the ASDF file, which would presumably be decomposed into branches.
Food for thought.