Closed jpolchlo closed 2 years ago
As mentioned, Spark is not good at wide data due to the complexity of the catalyst structures that are needed to build to represent the dataframe. Treat this as an historical artifact that may be of use at some later date.
Overview
We've had some difficulty converting the entire retrospective NWM output into Parquet. Our aim here was to create a wide table to accomplish this goal using Spark. The code submitted here should work, but some significant difficulties mean that this submission exists only as a record of a thing we tried that also didn't work.
Strictly speaking, I didn't try as absolutely hard as I could have to make this work. But I did make a solid effort to run this. The issue is that the table is so wide that the schema results in Spark SQL encoders that are very large. (It appears to be known that Spark is not very effective at wide data.) These take up substantial amounts of memory that hose the GC, and the overhead of transmitting the encoder structures leads to a long wait time before being able to even populate a dataframe, much less write it.
There is some indication that simply writing these data out with native Parquet libs (not using Spark) could be quite straightforward, circumventing the problems that were encountered when using spark. However, it's unclear whether Hadoop can be used to write these data to S3.
Checklist
Rannbautoexport export .
in/opt/src/notebooks
and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)Notes
This PR requires building the s3+hdfs branch of the thredds library. Because that branch relies on projects published to Bintray, which has been sunsettted, the process takes several steps:
gradlew assemble; gradlew publishToMavenLocal
mvn package; mvn install
(the package subcommand may be redundant)build.gradle
forthredds
1.4.2
ingradle/any/dependencies.gradle
from1.4.2-SNAPSHOT
forthredds
Testing Instructions
I used
as a test. This took an extremely long time to run, and I still have not produced output for two rows of data. It appears that 16GB is not enough. Which is crazy.
Connects #84