ORC IO - Githubissues

geotrellis / vectorpipe

Convert Vector data to VectorTiles with GeoTrellis.

https://geotrellis.github.io/vectorpipe/

Other

74 stars 20 forks source link

ORC IO #18

Closed fosskers closed 7 years ago

fosskers commented 7 years ago

TODO

[x] fromLocalORC
~~fromRemoteORC ?~~ (fromLocalORC is now fromORC)
[x] Testing
[x] Docs

Closes #8 .

mojodna commented 7 years ago

fromRemoteORC is probably fromLocalORC with URIs that point to remote sources (e.g. s3a://...)

fosskers commented 7 years ago

Probably! I haven't looked into the remote case at all yet. I was assuming there would be some AWS magic. I remember the word "Athena"?

mojodna commented 7 years ago

Nah, in this case you're just looking at Hadoop / HDFS / other filesystem "magic."

Athena is AWS's hosted Presto service.

fosskers commented 7 years ago

This is very good news.

fosskers commented 7 years ago

fromLocalORC timings in seconds, counting Elements on a 178mb file:

Nodes: 32.3s to count 15.4mil nodes Ways: 18.7s to count 1.49mil ways Rels: 11.5s to count 86k relations

Interestingly, my streaming-osm lib can count that same number of nodes in ~30s with a single core. Spark overhead must be pretty heavy here, but the good sign is that our time isn't linear/quadratic. i.e. Counting the rels might be "slow", but to count 2 orders of magnitude more nodes doesn't take 100 times as long.

This should speed up nicely with a big EMR cluster.

lossyrob commented 7 years ago

Interestingly, my streaming-osm lib can count that same number of nodes in ~30s with a single core.

How many cores is spark using? Making sure Spark is configured to utilize all cores and memory and increasing the partition count could potentially speed things up, even in the local case.

fosskers commented 7 years ago

I'm using local[*] and I see it grinding away with all 4 cores on my machine. I always forget about partition count, so tweaking that might speed things up.

fromORC is ass-slow when dealing with large files on S3, but that's expected. Should magically be faster on EMR because of its special connection to S3.