earthcubeprojects-chords / chords

EarthCube CHORDS application code
GNU General Public License v2.0
25 stars 13 forks source link

Integrating image data with CHORDS #420

Open ryangooch opened 6 years ago

ryangooch commented 6 years ago

Defining support for image data is one major component of CHORDS version 1.0, with the expectation that support will be added in version 1.1. This issue will serve to open up and document the discussion around how we can define and implement this functionality going forward.

Some topics of discussion:

Weather Radar Data In many cases, weather radar data is stored in NetCDF data stores, which are hierarchical datasets with variable and metadata. The data themselves are stored in radar "moments" (like Reflectivity, Specific Differential Phase, etc) as arrays of numerical data points. In the case of plan position indicator (PPI) scans, an elevation angle is fixed while the radar scans many 1-D range radials, sweeping in azimuth. This produces a 2-D "image" for each radar moment, where each "pixel" corresponds to a range (distance from radar) and azimuth (angle). Each pixel is also representable in latitude and longitude, however, generating a latitude/longitude grid would be required as the CF spec (see below) does not require this in the NetCDF file. Additionally, there can be more scans for more elevation angles in a volume scan, which is usually one of the more fundamental "chunks" of data that radar datasets are stored and communicated in.

So in summary, there may be several 2-D scans for each radar moment, and there could be tens of radar moments of interest for each radar file. I think a good starting point, though, would be to pass individual moments at one scan elevation (or at one grid height), which essentially becomes one image. Something that may be useful for CHORDS and for working with the data in applications built on top of CHORDS would be to map the polar radar data onto a Cartesian grid, which CHORDS could be aware of prior to data upload, expecting values at a previously set amount of latitude/longitude grid points.

Specifications From the perspective of weather radar data, it would be good I think to use the CF-Radial specifications for our standard, and be able to upload scans in either NetCDF or JSON formats.

Summary As stated above, this is really just to get the discussion going, so feel free to jump in and discuss any of the above points. Also, I'm happy to share data, scripts, etc if any are desired.

ryangooch commented 6 years ago

I wanted to add a little information with some more concrete examples as well. For those who are unfamiliar with the data type, you can take a look at this Gist I created:

CHORDS_Weather_Radar_Intro.ipynb

This goes through some basic file formats and plots one potential image for the type of data we're interested in passing through CHORDS. I also wanted to say that this is a living document and I'll be adding to it over time, but this is a basic look to get you started.

I also wanted to give more concrete examples of the data stored in JSON format (CHL_20170508_171414_RHI_02.json.gz) as well as a look at a mock JSON-LD schema I've been working on for Linked Data integration (chill_radar_schema_definition). These last two files are with respect to the CSU-CHILL radar, and both will be different with respect to the DFW radars (I am working on getting the schemes defined for these and will share that ASAP).

As I progress more on these I'll add more, but in the meantime, if there are any questions or thoughts on how to add this information below, I'll be happy to discuss.

zastruga commented 6 years ago

I think when we talked about this the last time, we agreed that this was probably the end goal, but initially we would just be taking in actual images (JPEG, PNG, etc) with geospatial metadata. Was I mistaken on that?

ryangooch commented 6 years ago

That was Dr. Chandra's suggestion but I wasn't sure if it was the defined goal. If that is the case though, I can work on putting the data in that format.

I will say that I don't love the idea personally for a couple of reasons. First, it encodes the native data formats into either grayscale or RGB, which themselves would be based on the colormap resolution. For example, if you look at an image of weather radar data like this,

radarexample

what you are seeing is radar data encoded using a color mapping for display. This image could be passed to CHORDS, and perhaps that is sufficient as an intermediate step to the end goal, but I would argue that this data would not be useful for any analytical purposes as the RGB values would correspond to a non-empirical resolution of values, lumping several dB together because it "looks good" rather than it necessarily indicating anything scientifically interesting.

But again, if the goal is simply to see an image of the data through CHORDS, this would be sufficient for now. And that would have some benefit for us, but it wouldn't be useful for analysis.

ryangooch commented 6 years ago

I came across another paper in my research for this project, titled "Conception and Implementation of an OGC-Compliant Sensor Observation Service for a Standardized Access to Raster Data" where they discuss how they transmit, visualize, and store weather radar data. We wouldn't need to incorporate all of their approach as we are assuming the data is stored long-term elsewhere, but the methods this group incorporates for their real-time workflow seems to be quite good.

I also wanted to note that these standards and methods share overlap with the Leadbetter et al paper, which is another example of integrating sensor workflows with modern standards.

Summarizing:

The paper overall is worth a skim at least. There are snippets of standardized responses and references for various radar scans, as well as a graphic showing off the final product. They mention that the CF-convention is adhered to, which is something we are operating with as well.

heavy_precip Heavy precip detection in their GUI, enabled by data model, workflow, and API

I focused on architecture here but much of the paper focuses on how things are encoded to facilitate queries, handle responses, and other implementation details of interest.

MisterMartin commented 6 years ago

Here is an article that discusses some approaches and testing of database approaches for gridded data. It is oriented towards agricultural data, but the concepts are the same.

Erik Johnson suggested THREDDS, which could be a viable option. There already exists a docker container.

ryangooch commented 6 years ago

@MisterMartin, thanks for passing along this article. It is pretty interesting in offering some solid benchmarks for managing large amounts of data in NetCDF format files. It also confirms some of the research I had done last week. One paragraph from the results section stuck out to me:

NetCDF is the second best performing storage solution when using a traditional spindle drive when running the potato late blight model and the best of all cases running the model when using multiple SSD drives. NetCDF avoids the management overhead necessary for both the MongoDB and PostgreSQL servers. Unlike the MongoDB system the netCDF system does not require a process of converting text to in-memory Python arrays.

Keeping the files in NetCDF format makes a sense in that we don't have to reinvent the wheel, and as I have discovered, other groups are working on using these files as the endpoint storage solutions in scalable architectures. One such example is PySpark4Climate.

I have been playing around with a potential workflow involving xarray and Dask. xarray was built to offer analytics on N-dimesional arrays like those in NetCDF files, and Dask is used to facilitate parallel computing. I have been using these to see if files stored as NetCDFs can be compare, indexed, and analyzed without bringing in the overhead of a Postgres or a MongoDB, and so far the results are intriguing. For example, I've been able to reproduce some QC analysis we do to adjust for biases in radar moments like Zdr, efficiently and easily, even when it involves hundreds of radar files and a few GB of data (this contraint being based on my local machine capabilities). The graph attached here is the product of that, if you are interested.

san_jose_1hr_zdr_zh_cal

This all to say that perhaps, with some "small" (I hope) programming effort, we can store these files in their native format and still get some very useful insights and straightforward programming benefits out of them.

As for THREDDS, I don't have much experience there, but the above workflow would replace some of its functionality (I think), though THREDDS having a docker container is interesting

ryangooch commented 6 years ago

Here's what I have so far on specifications based on our conversations here and in conference calls. If there are discrepancies or mistakes, let me know.

Image Data - "Two-dimensional gridded data" Data Storage format - NetCDF Data Upload format - NetCDF Standards - CF, CF/Radial, Linked Data, OGC SOS & SWE Visualization - Plots on Basemap, multiple sensors in geospatial area

Also discussed BLOB objects as above, as well as implementing methods for spatial and temporal indexing across multiple sensors.

ryangooch commented 5 years ago

Working towards version 1.0 and our requirement to have Image Data Specifications in place at point of release, I have added the above specifications to a Google Doc as a starting point for this discussion. I listed a few bullets based on some of the primary concerns for this issue based on our technical calls in the past and other discussions we have had off- and online.

So far I focused on "Specification" and not "Implementation", though the latter will of course inform the former. Take a look at it if you can, and add comments, add specs, and make sure your application and its needs are represented.

zastruga commented 5 years ago

This page mentions two gems which may be of use for storing image data to disk with references to them in the database, Paperclip and CarrierWave.

https://www.pluralsight.com/guides/handling-file-upload-using-ruby-on-rails-5-api https://itnext.io/uploading-files-to-your-rails-api-6b293a4a5c90

daniels303 commented 5 years ago

According to this, Paperclip is being deprecated in favor of ActiveStorage

zastruga commented 5 years ago

Looks like ActiveStorage requires Rails 5.2. I can't remember the reason right now, but I was unable to go all the way to 5.2 when I upgraded to Rails 5 and landed at 5.1. ActiveStorage does appear to allow using Amazon S3 for file storage, but that would also mean slightly more cost involved in storing and accessing image-type data.

ryangooch commented 5 years ago

I just discovered NOAA has quite a few real-time data products available. I was looking for documentation and papers, but it's probably worth a look: ESRL Scanning Radar Tool

That is one good example of what could be done, but they also have several "report-style" images for other instruments that are interesting. This paper used one of the products for a science use case, in particular: Precipitation Hazard Prediction. Even this style of plot, "latest observations for collocated instruments" or something could be very valuable. In the sense of weather radar, we could generate a report consisting of 6 images, the major radar moments, and simply have the latest completed VCP to refer to.

ryangooch commented 5 years ago

See here for an example "radar plot report" that I put together for a radar scan from a radar (Midlothian) in the CASA DFW network.

Example "report"