azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
80 stars 26 forks source link

Introduce SAX parser for Change files #131

Closed jpolchlo closed 5 years ago

jpolchlo commented 5 years ago

Introduces a SAX parser-based ChangeSource. This answers the question of what to do with very large change files. We've encountered change files that are large enough that scala.xml.XML.loadString crashes; the SAX parser will handle such files. It is also the case that the SAX parser provides a modest speedup:

[info] Running (fork) org.openjdk.jmh.Main -t 1 -f 1 -wi 5 -i 5 SAXBench.*
[info] # JMH version: 1.19
[info] # VM version: JDK 1.8.0_152, VM 25.152-b16
[info] # VM invoker: /opt/oracle-jdk-bin-1.8.0.152/jre/bin/java
[info] # VM options: <none>
[info] # Warmup: 5 iterations, 1 s each
[info] # Measurement: 5 iterations, 1 s each
[info] # Timeout: 10 min per iteration
[info] # Threads: 1 thread, will synchronize iterations
[info] # Benchmark mode: Average time, time/op
[info] # Benchmark: osmesa.SAXBench.getSAXyGirl
[info] # Run progress: 0.00% complete, ETA 00:00:20
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 125519.792 us/op
[info] # Warmup Iteration   2: 32339.142 us/op
[info] # Warmup Iteration   3: 27834.656 us/op
[info] # Warmup Iteration   4: 29749.730 us/op
[info] # Warmup Iteration   5: 26116.916 us/op
[info] Iteration   1: 28236.399 us/op
[info] Iteration   2: 25912.427 us/op
[info] Iteration   3: 26004.409 us/op
[info] Iteration   4: 25559.654 us/op
[info] Iteration   5: 25996.652 us/op
[info] Result "osmesa.SAXBench.getSAXyGirl":
[info]   26341.908 ±(99.9%) 4137.686 us/op [Average]
[info]   (min, avg, max) = (25559.654, 26341.908, 28236.399), stdev = 1074.544
[info]   CI (99.9%): [22204.222, 30479.594] (assumes normal distribution)
[info] # JMH version: 1.19
[info] # VM version: JDK 1.8.0_152, VM 25.152-b16
[info] # VM invoker: /opt/oracle-jdk-bin-1.8.0.152/jre/bin/java
[info] # VM options: <none>
[info] # Warmup: 5 iterations, 1 s each
[info] # Measurement: 5 iterations, 1 s each
[info] # Timeout: 10 min per iteration
[info] # Threads: 1 thread, will synchronize iterations
[info] # Benchmark mode: Average time, time/op
[info] # Benchmark: osmesa.SAXBench.useScala
[info] # Run progress: 50.00% complete, ETA 00:00:10
[info] # Fork: 1 of 1
[info] # Warmup Iteration   1: 176797.254 us/op
[info] # Warmup Iteration   2: 44653.817 us/op
[info] # Warmup Iteration   3: 32220.748 us/op
[info] # Warmup Iteration   4: 33895.272 us/op
[info] # Warmup Iteration   5: 30096.884 us/op
[info] Iteration   1: 31979.810 us/op
[info] Iteration   2: 30700.955 us/op
[info] Iteration   3: 29977.848 us/op
[info] Iteration   4: 30450.783 us/op
[info] Iteration   5: 29962.016 us/op
[info] Result "osmesa.SAXBench.useScala":
[info]   30614.282 ±(99.9%) 3180.811 us/op [Average]
[info]   (min, avg, max) = (29962.016, 30614.282, 31979.810), stdev = 826.046
[info]   CI (99.9%): [27433.472, 33795.093] (assumes normal distribution)
[info] # Run complete. Total time: 00:00:21
[info] Benchmark             Mode  Cnt      Score      Error  Units
[info] SAXBench.getSAXyGirl  avgt    5  26341.908 ± 4137.686  us/op
[info] SAXBench.useScala     avgt    5  30614.282 ± 3180.811  us/op

Closes #119

jpolchlo commented 5 years ago

This PR also needs testing in whichever setting you encountered the problem with the old XML parser, @mojodna. I saw some potential problems with this in the REPL (very long run time due to GC pressure, I think).