bmwcarit / barefoot

Java map matching library for integrating the map into software and services with state-of-the-art online and offline map matching that can be used stand-alone and in the cloud.
Apache License 2.0
664 stars 185 forks source link

Hadoop friendly architecture / directly load OSM data #120

Open geoHeil opened 5 years ago

geoHeil commented 5 years ago

How hard do you think would it be to build an add-on that instead of loading from postGIS would load the data from parquet files directly stored inside hadoop?

https://github.com/adrianulbona/osm-parquetizer

This data format is published daily at: http://osm-data.skobbler.net already in converted format.

oldrev commented 5 years ago

Hi, the answer is not that hard, you could do it by implements your own RoadReader interface.

Copy and modify the PostGISReader class is a good start.

jongiddy commented 5 years ago

Barefoot does some transformation of the PBF files to create an efficient dataset for its use. It would be great to be able to do that conversion in a Hadoop cluster, but I don't think it is trivial.

However, once you have the correctly formatted files in the Hadoop cluster, it should be fairly easy to create a new Parquet-aware RoadReader.

I do the initial processing on a local VM, using the map/osm/import.sh script to import PBF data into PostgreSQL, then https://github.com/jongiddy/barefoot-map-db-file to export from PostgreSQL to a single .bfmap file, which I then upload to HDFS. My Spark jobs use https://github.com/jongiddy/barefoot-hdfs-reader to read the map data from HDFS.

geoHeil commented 5 years ago

Thanks

That is great news. jongiddy notifications@github.com schrieb am Mo. 27. Aug. 2018 um 17:44:

Barefoot does some transformation of the PBF files to create an efficient dataset for its use. It would be great to be able to do that conversion in a Hadoop cluster, but I don't think it is trivial.

However, once you have the correctly formatted files in the Hadoop cluster, it should be fairly easy to create a new Parquet-aware RoadReader .

I do the initial processing on a local VM, using the map/osm/import.sh script to import PBF data into PostgreSQL, then https://github.com/jongiddy/barefoot-map-db-file to export from PostgreSQL to a single .bfmap file, which I then upload to HDFS. My Spark jobs use https://github.com/jongiddy/barefoot-hdfs-reader to read the map data from HDFS.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bmwcarit/barefoot/issues/120#issuecomment-416270501, or mute the thread https://github.com/notifications/unsubscribe-auth/ABnc9DB_9Jyr6r-Zcr4EMkordjTG0b6Jks5uVBOmgaJpZM4WLbGN .

geoHeil commented 5 years ago

@jongiddy do I understand correctly, that the map is always loaded completely into memory (and especially for the whole world) requires fairly large amount of RAM for the executors?

Also when looking into the hadoop native file format. The driver would need to collect the whole parquet file and then broadcast it?

smattheis commented 5 years ago

@jongiddy and @oldrev already pointed out the relevant aspects. (Thanks!) I have only one note to add: The pre-processing step is mostly a transformation of OSM roads into a routable format which means splitting roads into edges of a graph. In OSM, roads are often long and cross intersections, as e.g. at the intersection of https://www.openstreetmap.org/way/33954504 and https://www.openstreetmap.org/way/31662854, such that a road must be split into multiple edges to represent the intersection and to allow turns. This pre-processing is done by the import scripts @jongiddy mentioned. A direct import into HDFS would need to implement that pre-processing step. Further, with the road readers you can define a subregion to be loaded into RAM or to save it in a HDFS file. However, routing and map matching across subregions is something that is not supported in the moment. This means, it won't help if you want to have a large map and just want organize in tiles. It only helps if you need, for some use case, ONLY a subregion of the map data you have imported into a map server initially.