Closed fosskers closed 7 years ago
fromRemoteORC
is probably fromLocalORC
with URIs that point to remote sources (e.g. s3a://...
)
Probably! I haven't looked into the remote case at all yet. I was assuming there would be some AWS magic. I remember the word "Athena"?
Nah, in this case you're just looking at Hadoop / HDFS / other filesystem "magic."
Athena is AWS's hosted Presto service.
This is very good news.
fromLocalORC
timings in seconds, counting Elements on a 178mb file:
Nodes: 32.3s to count 15.4mil nodes Ways: 18.7s to count 1.49mil ways Rels: 11.5s to count 86k relations
Interestingly, my streaming-osm
lib can count that same number of nodes in ~30s with a single core.
Spark overhead must be pretty heavy here, but the good sign is that our time isn't linear/quadratic. i.e. Counting the rels might be "slow", but to count 2 orders of magnitude more nodes doesn't take 100 times as long.
This should speed up nicely with a big EMR cluster.
Interestingly, my
streaming-osm
lib can count that same number of nodes in ~30s with a single core.
How many cores is spark using? Making sure Spark is configured to utilize all cores and memory and increasing the partition count could potentially speed things up, even in the local case.
I'm using local[*]
and I see it grinding away with all 4 cores on my machine. I always forget about partition count, so tweaking that might speed things up.
fromORC
is ass-slow when dealing with large files on S3, but that's expected. Should magically be faster on EMR because of its special connection to S3.
TODO
fromLocalORC
(fromRemoteORC
?fromLocalORC
is nowfromORC
)Closes #8 .