azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Direct translation of NetCDF files into Parquet with Spark #96

Closed jpolchlo closed 2 years ago

jpolchlo commented 2 years ago

Overview

We've had some difficulty converting the entire retrospective NWM output into Parquet. Our aim here was to create a wide table to accomplish this goal using Spark. The code submitted here should work, but some significant difficulties mean that this submission exists only as a record of a thing we tried that also didn't work.

Strictly speaking, I didn't try as absolutely hard as I could have to make this work. But I did make a solid effort to run this. The issue is that the table is so wide that the schema results in Spark SQL encoders that are very large. (It appears to be known that Spark is not very effective at wide data.) These take up substantial amounts of memory that hose the GC, and the overhead of transmitting the encoder structures leads to a long wait time before being able to even populate a dataframe, much less write it.

There is some indication that simply writing these data out with native Parquet libs (not using Spark) could be quite straightforward, circumventing the problems that were encountered when using spark. However, it's unclear whether Hadoop can be used to write these data to S3.

Checklist

Notes

This PR requires building the s3+hdfs branch of the thredds library. Because that branch relies on projects published to Bintray, which has been sunsettted, the process takes several steps:

  1. get the proper branch of thredds:
    git clone 'git@github.com:Unidata/thredds.git'
    cd thredds/
    git fetch origin 'feature/s3+hdfs:feature/s3+hdfs'
    git checkout 'feature/s3+hdfs'
  2. clone and publish to local maven the following repo: https://github.com/Unidata/gretty using gradlew assemble; gradlew publishToMavenLocal
  3. clone and publish https://github.com/Reading-eScience-Centre/edal-java (branch: edal-1.4.2) using mvn package; mvn install (the package subcommand may be redundant)
  4. remove the Bintray repository and add mavenLocal() to the list in build.gradle for thredds
  5. update the edal version to 1.4.2 in gradle/any/dependencies.gradle from 1.4.2-SNAPSHOT for thredds
  6. build and publish thredds:
    ./gradlew assemble
    ./gradlew publishToMavenLocal

Testing Instructions

I used

spark-submit --driver-memory 16G --class com.azavea.noaa.Main target/scala-2.12/noaa-nwm-assembly-0.1.0.jar -s 2018030412 -e 2018030413 -o /tmp/test.parquet

as a test. This took an extremely long time to run, and I still have not produced output for two rows of data. It appears that 16GB is not enough. Which is crazy.

Connects #84

jpolchlo commented 2 years ago

As mentioned, Spark is not good at wide data due to the complexity of the catalyst structures that are needed to build to represent the dataframe. Treat this as an historical artifact that may be of use at some later date.