Open JanKrivanek opened 7 months ago
Yes, I've been thinking about redesigning the format as well. I thought we could write out each file individually like we write out individual strings, and no need for nested compression. Basically have a new kind of record "File" that's literally two strings, one for file path and the other is for the content (uncompressed). Maybe also the timestamp and attributes (or other metadata?). The outer stream would compress anyway. Need to measure how the performance of this approach would compare to the current one, and what the binlog file size would be. The upside of this approach is that if the build crashes or is interrupted in the middle, we still get all the records, strings and files embedded so far.
Another idea would be to unzip the file archive when opening the binlog, as usual, and then immediately re-zip each file individually, and store the compressed byte[] for each file. Unpack each file text on demand. This would spike memory and CPU when reading the binlog initially, but would settle to lower memory consumption of the viewer after it's open. Also seems like the cheapest approach.
Or we could store records, strings, name-value lists and files in four separate streams and concatenate them at the end. The downside is if the build fails in the middle, we lose everything.
I don't know the binlog file format, but since it's a gzipped stream I may suggest an idea: maybe you could use the fact that two concatenated gzip streams are a valid gzip stream (see RFC 1952, section 2.2).
When writing the binlog, you end the current gzip stream right before writing the embedded file, then write the embedded file in its own gzip stream (use the Z_FINISH
flush level of zlib).
When reading the binlog, you can remember the byte offset in the binlog file for each embedded file, then start decompressing from there when you'll have to access it.
The main problem with this approach is that I don't think GZipStream
will be helpful with this (I use my own stream classes which wrap zlib for stuff like this). You'd need a stream which would stop between concatendated gzip streams when calling Read
in order to be able to retrieve the correct offset in the binlog (I'm not sure if GZipStream
can do this). I suppose writing would be less of a problem, you could just create a new GZipStream
instance for each part of the binlog. Also, GZipStream
was unable to read concatenated gzip streams on the .NET Framework, it would just stop after the first part (this has been fixed in .NET Core).
Here are my two cents, hoping they can be useful 🙂
I made a refactoring that at least moves the logic that reads embedded files to a background thread and reports the time it took in the bottom info bar after the binlog is loaded.
Context
Inspired byt the discussion here: https://github.com/KirillOsenkov/MSBuildStructuredLog/pull/711#discussion_r1382701666 Embedded files are currently greedely fully read into memory during opening of binlog - while they might never be accessed.
Gotchas
Embedded files are a ziparchive which is within the zipstream of a binlog - acessing those later on would require one of those optins:
Each of those options have significant downsides. The optimal way would need to be tested.
Alternative
Redesigning the binlog format. E.g.: compressed events stream and files zip archive would be two independent streams within single file. The file would have few empty bytes prealocated on the begining and those would then be overwritten as the binlog would be writen:
Other possible alternations to the format can be done at the same time to optimize the compression ratio, quicker redaction workflows etc. - e.g.:
FYI @rokonec - he was incepting some of those ideas