Closed TheMarex closed 6 years ago
As per semantic versioning, I'm not sure if this would be breaking the API. Technically all these formats are used internal only. If we name the output file output.osrm
it would not even break any paths.
Prior art:
Even with the monstrous list of checks in (*) we would still have to take care of endianess and other edge cases no one thinks about. If we really want to go that route, this technique is what we should aim for: https://rodgert.github.io/2014/09/09/type-driven-wire-protocols-with-boost-fusion-pt1/
Note there is also bond which I used in my little Distributed Search example (ref: schema, serialization, deserialization). From all serialization libs, they have the best stdlib integration (which may or may not help us, as we're using stxxl container anyways).
Serialization with encoding / decoding (à la Protobuf) is too slow, we had that experience internally. Staying as close to a binary dump as possible is the way to go.
Kenton Varda's Cap'n Proto is such a format, standardizing what is common nowadays anyway, providing accessor functions that e.g. swap endianess only on platforms where we have to.
Please read this highly informative thread about details and a comparison to other serialization formats: https://www.reddit.com/r/cpp/comments/3xmue9/quickly_loading_things_from_disk/cyaw5o3
Requirements: https://capnproto.org/install.html#supported-compilers Windows support only lacks reflections (which we don't need).
More reading material:
(also search for kentonv's replies ^)
Without doing a massive re-write, I think we should consider two smaller steps first:
uint32_t
instead of unsigned
, etc, and assert struct sizes when we're writing them to disk.If we do this across all the file reading/writing, this will get us a long way towards portability, and it should mostly just be a search/replace exercise, rather than a large re-structuring.
This will unlock some file portability for minimum effort.
I did a little bit of profiling while flying home tonight. I wanted to know the overhead of using ntohl/htonl
on big datasets. TL;DR - not much. Looking at the gcc x86 assembly output on OSX it compiles to BSWAP, which only takes a single clock cycle.
Quick-n-dirty test program here: https://gist.github.com/danpat/dd2e16131d3a723d75cf727ac0957703
compiled with:
g++ -O3 -I/usr/local/include -L/usr/local/lib -lboost_timer -ltbb -std=c++11 test.cc -o test
Results:
./test
Allocating 4000000000 bytes
Populating
0.418140s wall, 1.470000s user + 0.010000s system = 1.480000s CPU (353.9%)
parallel_for_each
0.417205s wall, 1.520000s user + 0.010000s system = 1.530000s CPU (366.7%)
parallel_for(blocked_range)
0.450355s wall, 1.540000s user + 0.010000s system = 1.550000s CPU (344.2%)
So, it's pretty lightweight, almost a no-op (in fact, on big-endian machines, the macro expands to nothing and the code is removed).
As a first step, we should put all values in network byte order before writing to disk, and convert to host byte order when reading back. This would mean datafiles would work across different endian architectures.
Next step would be locking down primitive sizes (e.g. using fixed uint32_t
instead of assuming unsigned
is 32 bits).
http://en.cppreference.com/w/cpp/language/types#Data_models
Ints are fine when constraining to 32/64 bit Linux, long/pointers/size_t are the problematic ones.
A first step in your directions would be:
is_trivially_copyable
(not defined in older stdlibc++s)and then replace all I/O functionality with this globally.
I'm interested in helping pushing this forward - I'll be doing a survey of where read/write is happening first, then seeing if I can wrap that in Cap'n Proto easily. It's been some time since I worked on C++ stuff, but I'm looking forward to playing with it again.
I wrote a demo program which reads/writes int32 vectors, comparing the native save as seen in the current osrm-backend with capnproto. Inspired by @danpat's impementation (thanks!) https://gist.github.com/duizendnegen/875cfa14e95cb09f7cb6fab893b4ea81
On my machine:
Allocating 400000000 bytes
Populating
Writing as binary to disk
1.051533s wall, 0.250000s user + 0.546875s system = 0.796875s CPU (75.8%)
Writing with cap'nproto to disk
1.145941s wall, 0.312500s user + 0.671875s system = 0.984375s CPU (85.9%)
Reading as binary from disk
0.465856s wall, 0.125000s user + 0.328125s system = 0.453125s CPU (97.3%)
Reading with cap'nproto from disk
0.848753s wall, 0.437500s user + 0.406250s system = 0.843750s CPU (99.4%)
File size on disk is identical for my machine. Curious what kind of results you obtain.
Just a note here, your serialize functions make a full copy of the data
serialize(std::vector<std::uint32_t> data);
you should take a read-only view (const vec& v
) instead. Maybe also assert the roundtrip actually works. Could you also upload the schema file?
(Also check out std::iota
)
Yep, next steps were to calculate a check-sum and verify integrity - I'll update the gist tonight, then will send another ping, including the schema.
Your blocked serialzation seems strange to me. Usually we just write the whole memory block with one syscall and let the OS deal with buffering the write process. For a fair comparison that should be replicated here.
Before we launch into this, we should first reiterate what the design goals here are:
I used the serialization behavior as I found in osrm-backend's io.hpp. I'll gladly switch this out, letting the OS do buffering etc.
The goal as I understand it (or at least, my goal) is to be able to serialize platform-independently. Using a library for that would be my preferred approach (over making sure we have to support platform-specific normalization), but I don't want to introduce extra dependencies unnecessarily - so I agree that iterating over the software design is critical. Possibly we can talk on IRC (_1009
on #osrm
), I'll try and address some concerns also in this issue by tonight.
After talking to @TheMarex and @danpat the following is a plan to completely rework osrm's I/O handling in an incremental matter. To re-iterate our priorities are to roughly keep the current performance but also to refactor the I/O code which currently is sprinkled all over the code base.
io.hpp
header you already found. But some I/O calls are littered throughout the code base. For those, you have to search for the .read
/ .write
calls.std::ifstream
s or boost::filesystem::fstream
s but we don't need them, we always open them in binary mode and only ever use their binary .read / .write features.As mentioned on IRC, sounds like a great roadmap - I hope my small benchmark program can be instrumental in comparing serialization for different data types.
I volunteered to scan for I/O operations across osrm-backend
, I'll try and get to that soon-ish.
In the mean time: I updated my Gist, w/ the (very basic) schema, some checksum checking & using std::iota
(nice, didn't know that)
I charted I/O usage in #3275, placing TODO notes where either io.hpp
was used, or ifstream
, ofstream
or fstream
was used. Also a partial scan of data types can be found there.
I'd be happy proceeding creating a File
type, but let's first discuss if there's enough information available now to move forward. Input is welcomed.
I must also add: data serialization with a schema will be a challenge (as there's lots of slightly different types being used), but if we decide to go this way, then creating a new data format & potentially wrapping all output in one big osrm
file might be feasible.
@daniel-j-h thanks for your thoughts in your notes about needing a File
abstraction or something even more abstract, something that catches also MemoryMap, ... Great to have that be part of the discussion in general.
I updated #3275 to also collect there the main documents up for consideration.
With regards to moving ahead and implementing some kind of data writing / reading mechanism, we should also carefully consider the interface for this. If the ability to read/write raw bytes to it, then schema tools should be disqualified. This is because we're then using something like reinterpret_cast<const char *>(buf)
to get to the bytes, and all the architecture problems again surface, unless we use something in the lines of @danpat's proposal for normalizing byte order.
I would argue there is an advantage for only serializing known types, and returning type-safe from the serialization layer (be it to file, to memory) - which is, type checking and serialization error surface area is reduces to a much smaller part of the code, and the complexity of juggling bytes is taken away from implementing the actual logic.
The downside is, however, that we have to create interfaces for all the known types, and that there's a certain cost (in terms of programming effort) of introducing new types. While this is not negligible, in my opinion this is surmountable.
This does mean that before abstracting some kind of File
(or Serialization
layer) we need to think of which solution we eventually want to use for writing to disk (with fstream
), writing to disk with other serialization tools & writing to memory. That'd imply that we need to do the benchmarking of different serialization benchmarks before moving further on creating the interface for the Serialization
layer.
(cc @ghoshkaj)
Here's what I have so far. I haven't gotten all the way through all of the files. https://gist.github.com/ghoshkaj/52116c4a717c45b0386e890c42fcacd2
Also, just a heads up, I'm getting close to merging https://github.com/Project-OSRM/osrm-backend/pull/3165
This consolidates the Shared/InternalDatafacades to work basically the same - only one uses a big block of IPC shared memory, and the other uses process-internal memory.
This will simplify the refactoring work here as we'll only have one set of file loading procedures to deal with (for now).
So from profiling, talking with @danpat and @TheMarex, this is the conclusion that has been drawn: (paraphrasing @danpat here) a lot of the data structures are vectors of structs or vectors of native types. So a good first step is to use something like a "deserializeVector" function. Any time we write a file it should contain:
More complicated cases to figure out after the simple cases are done:
a. querynode/originaledgedata to figure out
b. some files contain multiple vectors of data appended one after the other like hsgr
Re-reading all the earlier comments, this approach drifts a bit away from the original proposal of "one big file". However, we don't need to solve every problem right away. Let's start by untangling the current file reading/writing and push everything through a common framework.
I don't think we want the performance hit of any of the common serialization frameworks. At this stage, we're not proposing exporting these data files to any other systems, so we don't need quite the level of compatibility/portability those frameworks offer.
The following is a very simple file format proposal that standardizes what we're doing somewhat, but isn't a huge jump from where we are right now:
pseudocode struct OSRMFileHeader {
std::uint8_t prefix[4] = {'O','S','R','M'};
std::uint16_t creator_major_version = 1; // version of whoever wrote the file
std::uint16_t creator_minor_version = 1;
std::uint16_t creator_patch_version = 1;
std::uint16_t padding; // unused
std::uint64_t entry_count; // how many entries
std::uint64_t entry_size; // the size of 1 entry
char entry_typename[32]; // 32 byte (or null terminated if shorter) string for the name of the struct in this file
std::uint32_t file_crc32; // CRC32 of all data except the CRC32 field
std::uint8_t byte_order = 0; // 0 = big endian, 1 = little endian
// 65 bytes to this point
std::uint8_t padding[65 % entry_size]; // padding to align file header for mmap
};
IMPORTANT the header includes a byte_order
field. The writing process should write all values in host byte order, and save the endianness flag in the header. This avoids the need to do per-struct byte conversion, allowing bulk memory-to-disk writes.
Upon reading, the reader can bulk-read the data, then perform an in-memory byte-order conversion if necessary. For transporting datafiles between common architectures, this avoids the overhead of byte order conversion.
We will need a void bigEndianToHostByteOrder<T>(T &obj)
and void littleEndianToHostByteOrder<T>(T &obj)
method, and template specializations for each type we intend to read back from disk.
Code reading these files should:
OSRMFileHeader
*ToHostByteOrder<OSRMFileHeader>(header)
if necessaryentry_size
and typename
values are as expected (compare entry_size to sizeof(typename))entry_size x entry_count
bytes from disk into the destination bufferreinterpret_cast
the buffer, and call *ToHostByteOrder<TYPENAME>(buffer)
for each struct read.mmap()
If we decide to implement a low-memory *DataFacade, the file format should work well. Byte-order swapping would need to occur on-access, but that can be behaviour specific to that facade.
This is a proposal - please pick it full of holes.
I appreciate the NIH aspects of this - however, after having evaluated many serialization frameworks, I don't think our needs are complex enough to warrant the overhead they incur. Processing times for OSRM with large datasets are already slow - we don't want to add more overhead unnecessarily.
Questions:
OSRMFileV1SubHeader
headers per file?Proposals:
entry_size
. (for mmap)OSRMFileV1SubHeader::fingerprint
with OSRM versionBased on discussions today, @daniel-j-h suggested an API similar to those offered in Microsoft Bond: https://microsoft.github.io/bond/manual/bond_cpp.html
@daniel-j-h Can you sketch out how you see the interfaces looking for file reading/writing?
@TheMarex Suggestions noted - I've modified the proposed file structure above.
@TheMarex What are your thoughts on the fingerprint? We're currently generating that based on MD5sums of some source files. This works reasonably well during development - if we change antying in our I/O code, the fingerprint changes and we get notified, even if we forget to update the OSRM version number. If we moved to a simpler "compare the major/minor/patch" fields approach, we'd add the burden of keeping that up-to-date during development.
My inclination is to keep the fingerprint, but maybe adjust how it's being calculated - if we can consolidate the I/O code into a smaller set of files that only actually change when I/O structure changes, then it could continue to be useful during development, but be a lot more stable.
@danpat we already expose the version over cmake, so adding is would just be using a constant somewhere that we need anyway for pkg-config.
I'm not sure we should optimize for development here, usually you know when stuff you changed breaks the data and the tests take care of that automatically anyway.
My main argument for removing the current code is that it is a lot of code for very little functionality. Removing it would decrease our surface area. (and not make us depend on weird boost modules for uuid generation etc.)
@daniel-j-h Can you sketch out how you see the interfaces looking for file reading/writing?
Sure, here is roughly what I had in mind:
Usage would look like this.
To implement both abstractions from above we
File
abstraction)all other code in the Gist is 1/ exception types we can throw in case of errors and 2/ emulating concepts to get nice one-line compile time error messages.
Good to read this discussion, I feel like we're moving to a consensus about what's in scope for improvement & how to approach cross-platform / cross-architecture.
Two questions / concerns from my side:
vectors
of node
, rev_weight
and fwd_weight
on one vector, so that it can be serialized as one. Ideally a struct { node: [], rev_weight: [], fwd_weight: [] }
would be passed directly to aSerialization
interface. What's the stance on how to move forward with removing concerns about how data is structured from individual components to the Serialization
interface?mentioned we're moving away from everything in one big file to be included in the scope of this change
Think this is okay. The main argument with having "all in one file" was usability but we can so achieve this through over means (e.g. making it more obvious which files go where: osrm-contract, osrm-routed, both, ...). This can be handled in subsequent iterations.
The File
protocol is being addressed at #3321 (just putting this out here so new readers are aligned with what's happening where).
I found another type of IO usage: using libosmium
io
module (e.g. https://github.com/Project-OSRM/osrm-backend/blob/master/src/extractor/extractor.cpp#L131). Would it make sense to go through the project and mark those occurrences, too? I assume we also would need to refactor those into the same File
and Serialization
protocols.
@duizendnegen the libosmium
I/O can probably remain as-is - it already wraps the PBF binary on-disk format, and we only use it at that one location. Adding error handling there could be useful though.
Special care should be taken of the osrm-contract
produced file *.osrm.datasource_names
as it is in plain text, separated by new lines, where Windows generates \r\n
and Unix only generates \n
. (this seems to be the main exception from the vectors of structs and vectors of simple types rule)
We could binaryize this file like everything else - I think that might be preferable to maintaining annoying code for such a tiny feature.
What remains:
osrm-extract
, osrm-contract
and osrm-routed
to use this. This still requires consensus on what serialization tool is used.osrm-*
tools.Could we get the top of this issue to be updated with this (or an adapted version of this) to do list, for easier tracking?
@danpat @ghoshkaj @daniel-j-h - just wanted to ping you regarding usage of FileReader. Reaching the goal of this issue (IO normalization, IO serialization) is much harder when new primitives for specific use cases are introduced in the normalized file access layer.
I'm referring to GetVectorMemorySize
, ReadVectorSize
, ReadLines
, ReadLine
but also generally to the use of Skip
from FileReader
, https://github.com/Project-OSRM/osrm-backend/blob/master/include/storage/io.hpp.
This makes e.g. #3848 much harder to accomplish as now there are too many different cases in which the ElementCount can be affected (referring to Skip
).
The main risk I see is that by breaking encapsulation like this (we are just putting a small wrapper around I/O functions) no abstraction is actually accomplished, leading to unmaintainable software.
In my opinion those utility functions like GetVectorMemorySize
(which is at the moment apparently unused) are the right way to go, but in a different abstracton - specifically the Serialization
abstraction, not the File
abstraction.
To be precise, my proposal would be moving these type of functions to their own layer, and with that also starting the cross-platform I/O.
Welcoming your thoughts!
This has shipped with #4960, we now use binary-in-tar as data storage instead of opaque blocks. For pushing this further to using something like FlatBuffer inside tar-files we should open a new issue when it becomes relevant.
Currently our output on a global planet extract looks like this:
Yes this are 16 files. Its about time that we reconcile this. This issues should capture our requirements around a new data format. Of the top of my head:
osrm-contract
)std::fstream
calls allover the place)Platform dependence
We are kind of fortunate in the sense that we don't need to support complex nested data types. Almost all data we use is just a big array of something. Usually 32bit integers, no floating points and no pointers (thankfully). So we might get away with a lot less complex solutions (no alignment problems). So what we want to make sure is that we get the following right:
std::uint64_t
)Existing solutions here:
Reading through Protobuf and Cap'n'Proto, they don't strike me as immediate fits. Both rely on schema based generators. I would expect we could capture the very tightly scoped functionality from above in a single header-only library.
References
/cc @daniel-j-h @danpat @miccolis