Make the Normalized File Format a supported feature

woodbri commented 10 years ago

Currently osrm-extract extracts data from OSM into a Normalized File Format, that is documented on the Wiki. A few people, including myself, have made attempts to use these files to import non-OSM data into OSRM.

The problem currently is that the documentation is not being updated when the code changes. This requires users of these file to reverse engineer them after every release. In addition to that the UUIDs that are written in the file headers are reflect the state of the OSRM library files that read/write these files and have changed in size from the last version to the current version. Unfortunately these files are not packaged into an API for others can use them easily in their code that is intended to work with OSRM.

In an ideal world, it would make sense to minimize the effort of supporting this valuable feature of OSRM by creating a small C++ class and library for reading and writing these files that would only require documenting the Class. Then other parties can link to that library and OSRM would use the same. A high level API for writing the .osrm, .osrm.names and *.osrm.restrictions files might be packaged something like libOSRMio.h and libOSRMio.so or include it in the existing libOSRM.so or whatever makes sense.

A simple Class might looks something like:

OSRMio io(std::string base_filename);

io.writeNames(std::vector names);
io.writeVertices(std::vector vertices);
io.writeEdges(std::vector edges);
io.writeRestrictions(std::vector restrictions);

For very large graphs we might want to have an interface that is more iterative in nature, like:

io.writeNames(int count_of_names);
for (int i=0; i<count_of_names; i++) {
    io.writeName(names[i]);
}

io.writeVertices(int count_of_vertices);
for (int i=0; i<count_of_vertices; i++) {
    io.writeVertex(lat[i], long[i], id[i], flags[i]);
}

etc for edges and restrictions. In my use case I would not use an array but rather records coming out of a database that need to get written.

The idea being to hide the details of file headers, internal file formats, etc. I'm assuming that osrm-extract and osrm-prepare would/could be able to use this same API so everything stays in sync.

DennisOSRM commented 10 years ago

I like the idea, but reading/writing arrays of objects one-by-one has a good performance penalty over writing all in one go.

woodbri commented 10 years ago

One of the benefits of this approach is that the API can be decoupled from the underlying implementation. The API just needs a reasonably stable interface that might include some convenience functions for the caller but the underlying implementation is hidden from the caller. For example postgresql does this when you fetch query results, you have a simple iterator to access results row by row or all at one go, but you have no idea how the database may be caching results or processing the query.

Anyway, regardless, anything that hides some of the current implementation will make this interface more stable from a third party point of view, which will increase its usage and make OSRM more widely accessible. I'm willing to help with testing and documentation. And will release FOSS code for a postgresql to osrm tools based on it.

DennisOSRM commented 10 years ago

I see the point. I'll keep this feature request in mind.

woodbri commented 10 years ago

Thanks. Let me throw out another alternative approach to supporting this. Define a text file format, it could roughly follow that of the the binary format and then create a reader for the text files. Maybe something like:

# <project>.osrm.names.txt
<file_format_version>
NAMES <number_of_names>
<id><tab><name>
...
END

# <project>.osrm.txt
<file_format_version>
VERTICES <number_of_vertices>
<lat><tab><long><tab><vertex_id><tab><bollard><tab><stop_light>
...
END
EDGES <number_of_edges>
<source_id><tab><target_id><tab><length_meters><tab><direction><tab><weight><tab><type><tab><name_id><tab><roundabout><tab><ignoreGrid><tab><accressRestricted><tab><contraFlow>
...
END

# <project>.osrm.restrictions.txt
<file_format_version>
RESTRICTIONS <number_of_restrictions>
<to_way_id><tab><via_node><tab><from_node><tab><to_node><tab><forbidden>
...
END

I would ignore lines matching regex /^\s#|^\s$/. While text files are not space efficient for large graphs, it does provide a human readable format that could be b2zip'd to reduce its size. Another benefit of this approach is that one would not need to link to the OSRM libraries and the files could be generated using Perl, Python, PHP, etc. Regardless, it might be nice to be able to dump an osrm file set as text for debugging purposes.

I specifically added the section headers and 'END' to provide a means to verify that the reader is reading the correct section and to assist in validating the data on input to make it a little more user friendly when it encounters an issues with the data.

DennisOSRM commented 10 years ago

ASCII/text based formats are even worse when it comes to performance. But it appears that an easy to use library interface would suffice, right?

woodbri commented 10 years ago

Yes, An easy to use interface would be fine.

DennisOSRM commented 10 years ago

OK, cool. Officially on the todo list, now

woodbri commented 10 years ago

@DennisOSRM I just started looking at updating my tools to the current master code and it looks like another reverse engineering task to sync up the code. Looks like there have been a lot of changes in Extractor/ExtractionContainers.cpp since my last go at this. #984 is also dealing with this issue. I'm wondering is this is anywhere close to the top of your todo list?

woodbri commented 10 years ago

It has been brought to my attention the my orsm-tools pgr2orsm does not convey the polyline information about an edge. Most GIS programs represent an edge as a polyline and the graph only needs the start and end node ids and the edge id that represents that edge. Then using the edge id you can reference the polyline. This seems to be handled differently in OSM but I'm not sure I understand all the implications of this. It seems that GIS edges need to be passed to Project-OSRM as collection of noded straight line segments. For example if I have an edge with coordinates A-B-C, then I need to pass A-B and B-C regardless that the graph only needs A-C because there is no way to pass the additional shape vertices in the polyline.

Q: Is this correct? Assuming it is correct.

Then in the API it would be good to have the ability to pass a polyline as an edge.

Q: Does OSRM do anything to simplify the graph by eliminating node with only two edges connecting to it? or does this happen as natural artifact of the contraction process?

Q: would separating out the edge shape information into a separate table so the graph only contained simplified graph structure reduce memory requirements and speed up processing?

emiltin commented 10 years ago

A way in OSM can have many nodes; it's a polyline with straight lines between nodes. I believe OSRM splits ways into individual pieces at every node. It also applies turn penalties at every node, as can be seen from the following example:

    Scenario: Winding roads
        Given the profile "turnbot"

        And the node map
            | a | b | c | d |
            |   |   |   |   |
            | e | f |   |   |
            |   | g | h |   |

        And the ways
            | nodes |
            | abcd  |
            | efgh  |

        When I route I should get
            | from | to | route | distance | time    |
            | a    | d  | abcd  | 300m +-1 | 30s +-1 |
            | e    | h  | efgh  | 300m +-1 | 50s +-1 |

The above scenario passes. The ways abcd and efgh have equal length, but efgh takes longer to pass because it contains two bends.

woodbri commented 10 years ago

@DennisOSRM FYI, the pgr2osrm is does not work with the OSRM v0.4 because the file formats have changed again and I'm not going to have time to reverse engineer and debug the new code for them on every release. It would be really great of we could come of with a simple table-like api for loading data, like a table of point, a table of edges, a table of restrictions. Most shapefile data would have an edge as a polyline with only a node_id for each end of the polyline, ie each edge is properly noded but might contain multiple shape points.

So if you can provide a stable api, I will write the the loading code shapefiles and postgresql database, or look into get GDAL/OGR to support the API.

DennisOSRM commented 10 years ago

As mentioned in the other issue: The way to go is to use a structured but efficient serialization protocol like protobuf. It's the same framework that is used to encode the pbf files for OpenStreetMap data. Protobuf has binding for all major programming languages and is pretty easy to use/read/understand. Let's put this on the roadmap for the 0.4.x release series.

woodbri commented 10 years ago

For reference:

This is a C and Java implementation that supposedly reads and writes OSM pbf file https://github.com/scrosby/OSM-binary and it has a simple tools to read and print a structural overview/outline of a pbf's contents.

DennisOSRM commented 10 years ago

Protobuf is the general framework while the osm binary format is an application of it. We plan to use the framework with our own format spec.

woodbri commented 10 years ago

Ok, that makes sense. I'll look forward to that and wait for you to publish more on it. Thank you.

sweco-semtne commented 8 years ago

Is there any update on this?

daniel-j-h commented 8 years ago

No progress on this front. In a different project I implemented a zero-copy streaming Protobuf powered I/O layer for serializing and deserializing huge graph datastructures (think north america), with an optional transparent compression and uncompression stream in between.

The implementation is not too hard, and had a lot of advantages: for example you could pre-process graphs on a larger machine (think aws ec2) and then move the portable serialized artifact to your laptop. Or write tools against the formalized Protobuf schema, even from different languages.

The huge disadvantage and the reason we do not yet have this for osrm-backend is the following: the overhead from serializing and deserializing, that is parsing and constructing the Protobuf messages. Byte-wise dumping and reading datastructures --- as bad as it is (and we are aware of this https://github.com/Project-OSRM/osrm-backend/issues/1682 https://github.com/Project-OSRM/osrm-backend/issues/1685) --- is orders of magnitude faster than Protobufs. This is something where Cap'n Proto or Flatbuffers might shine, but we have not yet explored these options.

If you consider this in the context of our recent Traffic effort where we strive to cut off every seconds possible from subsequent pre-processing steps, it's clear that the current byte-wise dumping is the baseline a portable format has to fight against.

daniel-j-h commented 7 years ago

Superseded by https://github.com/Project-OSRM/osrm-backend/issues/2242.

@danpat @ghoshkaj and @duizendnegen specced this out and are currently working on it. The first sweep over the code base abstracted all the manual read and writes behind types which can be re-used. See https://github.com/Project-OSRM/osrm-backend/issues/2242 for more details.

Project-OSRM / osrm-backend

Make the Normalized File Format a supported feature #825