Project-OSRM / osrm-backend

Open Source Routing Machine - C++ backend
http://map.project-osrm.org
BSD 2-Clause "Simplified" License
6.38k stars 3.38k forks source link

New OSRM data format #2242

Closed TheMarex closed 6 years ago

TheMarex commented 8 years ago

Currently our output on a global planet extract looks like this:

  25G latest.osrm
    4 latest.osrm.core
    8 latest.osrm.datasource_indexes
   12 latest.osrm.datasource_names
 7.7G latest.osrm.ebg
 7.7G latest.osrm.edges
 980M latest.osrm.enw
  21G latest.osrm.fileIndex
  11G latest.osrm.geometry
  18G latest.osrm.hsgr
 980M latest.osrm.level
 119M latest.osrm.names
 8.8G latest.osrm.nodes
   12 latest.osrm.properties
 1.3G latest.osrm.ramIndex
  15M latest.osrm.restrictions
   20 latest.osrm.timestamp

Yes this are 16 files. Its about time that we reconcile this. This issues should capture our requirements around a new data format. Of the top of my head:

We are kind of fortunate in the sense that we don't need to support complex nested data types. Almost all data we use is just a big array of something. Usually 32bit integers, no floating points and no pointers (thankfully). So we might get away with a lot less complex solutions (no alignment problems). So what we want to make sure is that we get the following right:

Existing solutions here:

Reading through Protobuf and Cap'n'Proto, they don't strike me as immediate fits. Both rely on schema based generators. I would expect we could capture the very tightly scoped functionality from above in a single header-only library.

References

/cc @daniel-j-h @danpat @miccolis

TheMarex commented 8 years ago

As per semantic versioning, I'm not sure if this would be breaking the API. Technically all these formats are used internal only. If we name the output file output.osrm it would not even break any paths.

daniel-j-h commented 8 years ago

Prior art:

Even with the monstrous list of checks in (*) we would still have to take care of endianess and other edge cases no one thinks about. If we really want to go that route, this technique is what we should aim for: https://rodgert.github.io/2014/09/09/type-driven-wire-protocols-with-boost-fusion-pt1/

Note there is also bond which I used in my little Distributed Search example (ref: schema, serialization, deserialization). From all serialization libs, they have the best stdlib integration (which may or may not help us, as we're using stxxl container anyways).

Serialization with encoding / decoding (à la Protobuf) is too slow, we had that experience internally. Staying as close to a binary dump as possible is the way to go.

Kenton Varda's Cap'n Proto is such a format, standardizing what is common nowadays anyway, providing accessor functions that e.g. swap endianess only on platforms where we have to.

Please read this highly informative thread about details and a comparison to other serialization formats: https://www.reddit.com/r/cpp/comments/3xmue9/quickly_loading_things_from_disk/cyaw5o3

Requirements: https://capnproto.org/install.html#supported-compilers Windows support only lacks reflections (which we don't need).

daniel-j-h commented 8 years ago

More reading material:

(also search for kentonv's replies ^)

danpat commented 8 years ago

Without doing a massive re-write, I think we should consider two smaller steps first:

  1. Write files in little-endian byte order, always. For x86, this will be a no-op, and we can provide macros for other platforms.
  2. Ensure we're writing known-sized datatypes. Switch to uint32_t instead of unsigned, etc, and assert struct sizes when we're writing them to disk.

If we do this across all the file reading/writing, this will get us a long way towards portability, and it should mostly just be a search/replace exercise, rather than a large re-structuring.

This will unlock some file portability for minimum effort.

danpat commented 8 years ago

I did a little bit of profiling while flying home tonight. I wanted to know the overhead of using ntohl/htonl on big datasets. TL;DR - not much. Looking at the gcc x86 assembly output on OSX it compiles to BSWAP, which only takes a single clock cycle.

Quick-n-dirty test program here: https://gist.github.com/danpat/dd2e16131d3a723d75cf727ac0957703

compiled with: g++ -O3 -I/usr/local/include -L/usr/local/lib -lboost_timer -ltbb -std=c++11 test.cc -o test

Results:

./test
Allocating 4000000000 bytes
Populating
 0.418140s wall, 1.470000s user + 0.010000s system = 1.480000s CPU (353.9%)
parallel_for_each
 0.417205s wall, 1.520000s user + 0.010000s system = 1.530000s CPU (366.7%)
parallel_for(blocked_range)
 0.450355s wall, 1.540000s user + 0.010000s system = 1.550000s CPU (344.2%)

So, it's pretty lightweight, almost a no-op (in fact, on big-endian machines, the macro expands to nothing and the code is removed).

As a first step, we should put all values in network byte order before writing to disk, and convert to host byte order when reading back. This would mean datafiles would work across different endian architectures.

Next step would be locking down primitive sizes (e.g. using fixed uint32_t instead of assuming unsigned is 32 bits).

daniel-j-h commented 8 years ago

http://en.cppreference.com/w/cpp/language/types#Data_models

Ints are fine when constraining to 32/64 bit Linux, long/pointers/size_t are the problematic ones.

A first step in your directions would be:

and then replace all I/O functionality with this globally.

duizendnegen commented 7 years ago

I'm interested in helping pushing this forward - I'll be doing a survey of where read/write is happening first, then seeing if I can wrap that in Cap'n Proto easily. It's been some time since I worked on C++ stuff, but I'm looking forward to playing with it again.

duizendnegen commented 7 years ago

I wrote a demo program which reads/writes int32 vectors, comparing the native save as seen in the current osrm-backend with capnproto. Inspired by @danpat's impementation (thanks!) https://gist.github.com/duizendnegen/875cfa14e95cb09f7cb6fab893b4ea81

On my machine:

Allocating 400000000 bytes
Populating
Writing as binary to disk
 1.051533s wall, 0.250000s user + 0.546875s system = 0.796875s CPU (75.8%)
Writing with cap'nproto to disk
 1.145941s wall, 0.312500s user + 0.671875s system = 0.984375s CPU (85.9%)
Reading as binary from disk
 0.465856s wall, 0.125000s user + 0.328125s system = 0.453125s CPU (97.3%)
Reading with cap'nproto from disk
 0.848753s wall, 0.437500s user + 0.406250s system = 0.843750s CPU (99.4%)

File size on disk is identical for my machine. Curious what kind of results you obtain.

daniel-j-h commented 7 years ago

Just a note here, your serialize functions make a full copy of the data

serialize(std::vector<std::uint32_t> data);

you should take a read-only view (const vec& v) instead. Maybe also assert the roundtrip actually works. Could you also upload the schema file?

(Also check out std::iota)

duizendnegen commented 7 years ago

Yep, next steps were to calculate a check-sum and verify integrity - I'll update the gist tonight, then will send another ping, including the schema.

TheMarex commented 7 years ago

Your blocked serialzation seems strange to me. Usually we just write the whole memory block with one syscall and let the OS deal with buffering the write process. For a fair comparison that should be replicated here.

Before we launch into this, we should first reiterate what the design goals here are:

  1. Why would we use CaptnProto?
  2. What downsides are there?
  3. Why not use @danpat's suggestion of just normalizing byte order?
duizendnegen commented 7 years ago

I used the serialization behavior as I found in osrm-backend's io.hpp. I'll gladly switch this out, letting the OS do buffering etc.

The goal as I understand it (or at least, my goal) is to be able to serialize platform-independently. Using a library for that would be my preferred approach (over making sure we have to support platform-specific normalization), but I don't want to introduce extra dependencies unnecessarily - so I agree that iterating over the software design is critical. Possibly we can talk on IRC (_1009 on #osrm), I'll try and address some concerns also in this issue by tonight.

daniel-j-h commented 7 years ago

After talking to @TheMarex and @danpat the following is a plan to completely rework osrm's I/O handling in an incremental matter. To re-iterate our priorities are to roughly keep the current performance but also to refactor the I/O code which currently is sprinkled all over the code base.

duizendnegen commented 7 years ago

As mentioned on IRC, sounds like a great roadmap - I hope my small benchmark program can be instrumental in comparing serialization for different data types.

I volunteered to scan for I/O operations across osrm-backend, I'll try and get to that soon-ish.

In the mean time: I updated my Gist, w/ the (very basic) schema, some checksum checking & using std::iota (nice, didn't know that)

duizendnegen commented 7 years ago

I charted I/O usage in #3275, placing TODO notes where either io.hpp was used, or ifstream, ofstream or fstream was used. Also a partial scan of data types can be found there.

I'd be happy proceeding creating a File type, but let's first discuss if there's enough information available now to move forward. Input is welcomed.

I must also add: data serialization with a schema will be a challenge (as there's lots of slightly different types being used), but if we decide to go this way, then creating a new data format & potentially wrapping all output in one big osrm file might be feasible.

duizendnegen commented 7 years ago

@daniel-j-h thanks for your thoughts in your notes about needing a File abstraction or something even more abstract, something that catches also MemoryMap, ... Great to have that be part of the discussion in general. I updated #3275 to also collect there the main documents up for consideration.

With regards to moving ahead and implementing some kind of data writing / reading mechanism, we should also carefully consider the interface for this. If the ability to read/write raw bytes to it, then schema tools should be disqualified. This is because we're then using something like reinterpret_cast<const char *>(buf) to get to the bytes, and all the architecture problems again surface, unless we use something in the lines of @danpat's proposal for normalizing byte order.

I would argue there is an advantage for only serializing known types, and returning type-safe from the serialization layer (be it to file, to memory) - which is, type checking and serialization error surface area is reduces to a much smaller part of the code, and the complexity of juggling bytes is taken away from implementing the actual logic.

The downside is, however, that we have to create interfaces for all the known types, and that there's a certain cost (in terms of programming effort) of introducing new types. While this is not negligible, in my opinion this is surmountable.

This does mean that before abstracting some kind of File (or Serialization layer) we need to think of which solution we eventually want to use for writing to disk (with fstream), writing to disk with other serialization tools & writing to memory. That'd imply that we need to do the benchmarking of different serialization benchmarks before moving further on creating the interface for the Serialization layer.

(cc @ghoshkaj)

ghoshkaj commented 7 years ago

Here's what I have so far. I haven't gotten all the way through all of the files. https://gist.github.com/ghoshkaj/52116c4a717c45b0386e890c42fcacd2

danpat commented 7 years ago

Also, just a heads up, I'm getting close to merging https://github.com/Project-OSRM/osrm-backend/pull/3165

This consolidates the Shared/InternalDatafacades to work basically the same - only one uses a big block of IPC shared memory, and the other uses process-internal memory.

This will simplify the refactoring work here as we'll only have one set of file loading procedures to deal with (for now).

ghoshkaj commented 7 years ago

So from profiling, talking with @danpat and @TheMarex, this is the conclusion that has been drawn: (paraphrasing @danpat here) a lot of the data structures are vectors of structs or vectors of native types. So a good first step is to use something like a "deserializeVector" function. Any time we write a file it should contain:

  1. a fingerprint/header
  2. a "number of entries" std::uin64_t
  3. that number of structs
  4. checksum

More complicated cases to figure out after the simple cases are done: a. querynode/originaledgedata to figure out b. some files contain multiple vectors of data appended one after the other like hsgr

danpat commented 7 years ago

Re-reading all the earlier comments, this approach drifts a bit away from the original proposal of "one big file". However, we don't need to solve every problem right away. Let's start by untangling the current file reading/writing and push everything through a common framework.

I don't think we want the performance hit of any of the common serialization frameworks. At this stage, we're not proposing exporting these data files to any other systems, so we don't need quite the level of compatibility/portability those frameworks offer.

The following is a very simple file format proposal that standardizes what we're doing somewhat, but isn't a huge jump from where we are right now:

Header

pseudocode struct OSRMFileHeader {
  std::uint8_t prefix[4] = {'O','S','R','M'};
  std::uint16_t creator_major_version = 1;  // version of whoever wrote the file
  std::uint16_t creator_minor_version = 1;
  std::uint16_t creator_patch_version = 1;
  std::uint16_t padding; // unused
  std::uint64_t entry_count;  // how many entries
  std::uint64_t entry_size;  // the size of 1 entry
  char entry_typename[32]; // 32 byte (or null terminated if shorter) string for the name of the struct in this file
  std::uint32_t file_crc32;  // CRC32 of all data except the CRC32 field
  std::uint8_t byte_order = 0; // 0 = big endian, 1 = little endian
  // 65 bytes to this point
  std::uint8_t padding[65 % entry_size];  // padding to align file header for mmap
};

IMPORTANT the header includes a byte_order field. The writing process should write all values in host byte order, and save the endianness flag in the header. This avoids the need to do per-struct byte conversion, allowing bulk memory-to-disk writes.

Upon reading, the reader can bulk-read the data, then perform an in-memory byte-order conversion if necessary. For transporting datafiles between common architectures, this avoids the overhead of byte order conversion.


Reading

We will need a void bigEndianToHostByteOrder<T>(T &obj) and void littleEndianToHostByteOrder<T>(T &obj) method, and template specializations for each type we intend to read back from disk.

Code reading these files should:

  1. Read the fixed-size OSRMFileHeader
  2. Check byte order. Call *ToHostByteOrder<OSRMFileHeader>(header) if necessary
  3. Verify the file format version is understood.
  4. Verify that the entry_size and typename values are as expected (compare entry_size to sizeof(typename))
  5. Adjust file position to skip any header padding
  6. Read entry_size x entry_count bytes from disk into the destination buffer
  7. Calculate and validate the CRC32 for all the data read so far
  8. If byte order swapping is necessary, reinterpret_cast the buffer, and call *ToHostByteOrder<TYPENAME>(buffer) for each struct read.

mmap()

If we decide to implement a low-memory *DataFacade, the file format should work well. Byte-order swapping would need to occur on-access, but that can be behaviour specific to that facade.


This is a proposal - please pick it full of holes.

I appreciate the NIH aspects of this - however, after having evaluated many serialization frameworks, I don't think our needs are complex enough to warrant the overhead they incur. Processing times for OSRM with large datasets are already slow - we don't want to add more overhead unnecessarily.

TheMarex commented 7 years ago

Questions:

Proposals:

  1. Make sure the data section is aligned as multiples of entry_size. (for mmap)
  2. Replace OSRMFileV1SubHeader::fingerprint with OSRM version
  3. Move OSRM version in primary header
danpat commented 7 years ago

Based on discussions today, @daniel-j-h suggested an API similar to those offered in Microsoft Bond: https://microsoft.github.io/bond/manual/bond_cpp.html

@daniel-j-h Can you sketch out how you see the interfaces looking for file reading/writing?

@TheMarex Suggestions noted - I've modified the proposed file structure above.

danpat commented 7 years ago

@TheMarex What are your thoughts on the fingerprint? We're currently generating that based on MD5sums of some source files. This works reasonably well during development - if we change antying in our I/O code, the fingerprint changes and we get notified, even if we forget to update the OSRM version number. If we moved to a simpler "compare the major/minor/patch" fields approach, we'd add the burden of keeping that up-to-date during development.

My inclination is to keep the fingerprint, but maybe adjust how it's being calculated - if we can consolidate the I/O code into a smaller set of files that only actually change when I/O structure changes, then it could continue to be useful during development, but be a lot more stable.

TheMarex commented 7 years ago

@danpat we already expose the version over cmake, so adding is would just be using a constant somewhere that we need anyway for pkg-config.

I'm not sure we should optimize for development here, usually you know when stuff you changed breaks the data and the tests take care of that automatically anyway.

My main argument for removing the current code is that it is a lot of code for very little functionality. Removing it would decrease our surface area. (and not make us depend on weird boost modules for uuid generation etc.)

daniel-j-h commented 7 years ago

@daniel-j-h Can you sketch out how you see the interfaces looking for file reading/writing?

Sure, here is roughly what I had in mind:

Usage would look like this.

To implement both abstractions from above we

all other code in the Gist is 1/ exception types we can throw in case of errors and 2/ emulating concepts to get nice one-line compile time error messages.

duizendnegen commented 7 years ago

Good to read this discussion, I feel like we're moving to a consensus about what's in scope for improvement & how to approach cross-platform / cross-architecture.

Two questions / concerns from my side:

TheMarex commented 7 years ago

mentioned we're moving away from everything in one big file to be included in the scope of this change

Think this is okay. The main argument with having "all in one file" was usability but we can so achieve this through over means (e.g. making it more obvious which files go where: osrm-contract, osrm-routed, both, ...). This can be handled in subsequent iterations.

duizendnegen commented 7 years ago

The File protocol is being addressed at #3321 (just putting this out here so new readers are aligned with what's happening where).

I found another type of IO usage: using libosmium io module (e.g. https://github.com/Project-OSRM/osrm-backend/blob/master/src/extractor/extractor.cpp#L131). Would it make sense to go through the project and mark those occurrences, too? I assume we also would need to refactor those into the same File and Serialization protocols.

danpat commented 7 years ago

@duizendnegen the libosmium I/O can probably remain as-is - it already wraps the PBF binary on-disk format, and we only use it at that one location. Adding error handling there could be useful though.

duizendnegen commented 7 years ago

Special care should be taken of the osrm-contract produced file *.osrm.datasource_names as it is in plain text, separated by new lines, where Windows generates \r\n and Unix only generates \n. (this seems to be the main exception from the vectors of structs and vectors of simple types rule)

danpat commented 7 years ago

We could binaryize this file like everything else - I think that might be preferable to maintaining annoying code for such a tiny feature.

duizendnegen commented 7 years ago

3496 was just merged, most IO is now through FileWriter / FileReader.

What remains:

Could we get the top of this issue to be updated with this (or an adapted version of this) to do list, for easier tracking?

duizendnegen commented 7 years ago

@danpat @ghoshkaj @daniel-j-h - just wanted to ping you regarding usage of FileReader. Reaching the goal of this issue (IO normalization, IO serialization) is much harder when new primitives for specific use cases are introduced in the normalized file access layer.

I'm referring to GetVectorMemorySize, ReadVectorSize, ReadLines, ReadLine but also generally to the use of Skip from FileReader, https://github.com/Project-OSRM/osrm-backend/blob/master/include/storage/io.hpp. This makes e.g. #3848 much harder to accomplish as now there are too many different cases in which the ElementCount can be affected (referring to Skip). The main risk I see is that by breaking encapsulation like this (we are just putting a small wrapper around I/O functions) no abstraction is actually accomplished, leading to unmaintainable software.

In my opinion those utility functions like GetVectorMemorySize (which is at the moment apparently unused) are the right way to go, but in a different abstracton - specifically the Serialization abstraction, not the File abstraction. To be precise, my proposal would be moving these type of functions to their own layer, and with that also starting the cross-platform I/O.

Welcoming your thoughts!

TheMarex commented 6 years ago

This has shipped with #4960, we now use binary-in-tar as data storage instead of opaque blocks. For pushing this further to using something like FlatBuffer inside tar-files we should open a new issue when it becomes relevant.