Direct translation of NetCDF files into Parquet with Spark

Overview

We've had some difficulty converting the entire retrospective NWM output into Parquet. Our aim here was to create a wide table to accomplish this goal using Spark. The code submitted here should work, but some significant difficulties mean that this submission exists only as a record of a thing we tried that also didn't work.

Strictly speaking, I didn't try as absolutely hard as I could have to make this work. But I did make a solid effort to run this. The issue is that the table is so wide that the schema results in Spark SQL encoders that are very large. (It appears to be known that Spark is not very effective at wide data.) These take up substantial amounts of memory that hose the GC, and the overhead of transmitting the encoder structures leads to a long wait time before being able to even populate a dataframe, much less write it.

There is some indication that simply writing these data out with native Parquet libs (not using Spark) could be quite straightforward, circumventing the problems that were encountered when using spark. However, it's unclear whether Hadoop can be used to write these data to S3.

Checklist

[ ] Ran nbautoexport export . in /opt/src/notebooks and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)
[ ] Documentation updated if needed
[x] PR has a name that won't get you publicly shamed for vagueness

Notes

This PR requires building the s3+hdfs branch of the thredds library. Because that branch relies on projects published to Bintray, which has been sunsettted, the process takes several steps:

get the proper branch of thredds:

git clone 'git@github.com:Unidata/thredds.git'
cd thredds/
git fetch origin 'feature/s3+hdfs:feature/s3+hdfs'
git checkout 'feature/s3+hdfs'

clone and publish to local maven the following repo: https://github.com/Unidata/gretty using gradlew assemble; gradlew publishToMavenLocal
clone and publish https://github.com/Reading-eScience-Centre/edal-java (branch: edal-1.4.2) using mvn package; mvn install (the package subcommand may be redundant)
remove the Bintray repository and add mavenLocal() to the list in build.gradle for thredds
update the edal version to 1.4.2 in gradle/any/dependencies.gradle from 1.4.2-SNAPSHOT for thredds

build and publish thredds:

./gradlew assemble
./gradlew publishToMavenLocal

Testing Instructions

I used

spark-submit --driver-memory 16G --class com.azavea.noaa.Main target/scala-2.12/noaa-nwm-assembly-0.1.0.jar -s 2018030412 -e 2018030413 -o /tmp/test.parquet

as a test. This took an extremely long time to run, and I still have not produced output for two rows of data. It appears that 16GB is not enough. Which is crazy.

Connects #84

azavea / noaa-hydro-data