Reliance on C libraries for NetCDF-4

davidmoten commented 8 years ago

Use of NetCDF-4 libraries for java is unfortunately limited at the moment by the dependency on installed C libraries.

I cannot deploy java code that writes NetCDF-4 files to a PAAS environment because I cannot deploy C libraries to that platform.
Deployment costs are greater for me for non-PAAS environments because I have to script up installation of the NetCDF4 binaries on the deployment platform instead of simply adding a netcdf4 dependency to my maven pom.xml.

This seems like a backward step for NetCDF, any plans for full java support (not to mention other languages)?

dblodgett-usgs commented 8 years ago

👍

cwardgar commented 8 years ago

NetCDF-4 is built atop the HDF5 data format, and we rely on the HDF5 native library that the HDF Group publishes in order to write NetCDF-4 files. We offer no pure-Java NetCDF-4 writer because the HDF Group offers no pure-Java HDF5 writer.

The HDF5 data format is incredibly complex. What's more is that it's not really standardized, making their C implementation the only "correct" one. Any attempt at another implementation would have to copy it very closely.

Ultimately it comes down to resources. A pure-Java writer is a huge task and given the current funding situation at Unidata, it is very unlikely that we will have resources to do this. The only real possibility is if somebody can come up with a half million dollars (I just made that up, but it could be accurate) to fund the effort.

@davidmoten Would Docker or similar technology ease the deployment burden?

davidmoten commented 8 years ago

@cwardgar thanks for the response, that's interesting.

I can appreciate that the current design makes it a big task to support other platforms/languages.

The HDF5 is clearly a complex yet powerful beast. Is the coupling of NetCDF 4 to HDF5 a strong one? I assume so.

Docker is a good idea when linux is the target and I would look to use that. My current targets include Solaris and we scripted deployment of netcdf4 and its deps a while back.

My organisation (that performs Search and Rescue in Australia) is moving a lot of stuff to the cloud and looking to leverage nice high availability fault tolerant scalable services. I guess I'd like to see the community that exchanges data via NetCDF4 aware that the current design precludes some cloud based scalable development options like serverless architectures applied to data processing pipelines. In our case the fact that the Australian Bureau of Meteorology publishes data in NetCDF4 format means that our processing options are limited and ultimately costing us more (for development, deployment,update management and runtime) even though the datasets themselves are not that large or complex (file by file). I might suggest alternative formats (?) to BOM but might be nice to see these issues being considered for NetCDF-5 (or 6 or 16 or whatever).

lesserwhirls commented 8 years ago

@davidmoten - how are you accessing the published data from BOM?

Let me also bring the lead netCDF-C developer into the conversation - @WardF, any thoughts regarding the cloud based scalable development options that @davidmoten mentions?

davidmoten commented 8 years ago

@lesserwhirls ftp download (secure protocol would be nicer but that's the current situation)

dopplershift commented 8 years ago

I'm curious to know what about netCDF4 precludes the use of a serverless architecture.

DennisHeimbigner commented 8 years ago

When I hear the term "serverless architecture" I interpret it to mean an event based model where incoming events (typically requests) cause a small piece of code to execute to handle that specific event. Is this what you mean?

cwardgar commented 8 years ago

Serverless architecture makes me think of AWS Lambda (which basically does what Dennis described). You should be able to use libnetcdf there as long as you create a custom AMI that includes it: http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

randallwhitman commented 7 years ago

In our framework based on Apache Spark, we use JVM-only code on the Spark workers - no JNA/JNI - for which reason so far we limit our NetCDF support to netCDF3.

DennisHeimbigner commented 7 years ago

Let me interject a question. A number of years ago I created a netcdf-c library server. The idea was that a program communicated with that server using remote procedure call so that the use of the C code was isolated to that server. Would such an approach mitigate some of your concerns? [Hope this is clear what I am proposing]

randallwhitman commented 7 years ago

I'd expect a single-node NetCDF-C server would negate the parallelism of the distributed Spark Workers, so we would not use such.

DennisHeimbigner commented 7 years ago

That will be (mostly) true even if you use a single JNA instance.

DennisHeimbigner commented 7 years ago

It occurs to me to add that a netcdf-c server can be as multi-threaded as any other server. The gotcha is that::

access to the library operations must be serialized
no two workers can read the same file (except under special circumstances)

Unidata / thredds

Reliance on C libraries for NetCDF-4 #604