[Feature, Perf] Audit and optimize the embedded files accessing code

JanKrivanek commented 7 months ago

Context

Inspired byt the discussion here: https://github.com/KirillOsenkov/MSBuildStructuredLog/pull/711#discussion_r1382701666 Embedded files are currently greedely fully read into memory during opening of binlog - while they might never be accessed.

Gotchas

Embedded files are a ziparchive which is within the zipstream of a binlog - acessing those later on would require one of those optins:

either leaving the stream open (and hence not veryfying it's properly terminated)
or rereading and again decompressing the entire binlog archive.
or copying the embedded zip archive into separate temporary file

Each of those options have significant downsides. The optimal way would need to be tested.

Alternative

Redesigning the binlog format. E.g.: compressed events stream and files zip archive would be two independent streams within single file. The file would have few empty bytes prealocated on the begining and those would then be overwritten as the binlog would be writen:

size of compressed events stream (so that this can be quickly skipped in uncompressed FileStream and the next stream - files ziparchive can be read)
size of ziparchive (as this cannot be reliably obtained from ZipArchive for possibly larger archives)
indication of file properly terminated (having this on befgining of file, instead of end, allows the completeness check on initial open, without the need to read the files ZipArchive).

Other possible alternations to the format can be done at the same time to optimize the compression ratio, quicker redaction workflows etc. - e.g.:

deduplicated strings are packed together and possibly compressed in separate stream from the rest of events. Again - the offset would be part of initial 'table of contents'

FYI @rokonec - he was incepting some of those ideas

KirillOsenkov commented 7 months ago

Yes, I've been thinking about redesigning the format as well. I thought we could write out each file individually like we write out individual strings, and no need for nested compression. Basically have a new kind of record "File" that's literally two strings, one for file path and the other is for the content (uncompressed). Maybe also the timestamp and attributes (or other metadata?). The outer stream would compress anyway. Need to measure how the performance of this approach would compare to the current one, and what the binlog file size would be. The upside of this approach is that if the build crashes or is interrupted in the middle, we still get all the records, strings and files embedded so far.

Another idea would be to unzip the file archive when opening the binlog, as usual, and then immediately re-zip each file individually, and store the compressed byte[] for each file. Unpack each file text on demand. This would spike memory and CPU when reading the binlog initially, but would settle to lower memory consumption of the viewer after it's open. Also seems like the cheapest approach.

Or we could store records, strings, name-value lists and files in four separate streams and concatenate them at the end. The downside is if the build fails in the middle, we lose everything.

ltrzesniewski commented 7 months ago

I don't know the binlog file format, but since it's a gzipped stream I may suggest an idea: maybe you could use the fact that two concatenated gzip streams are a valid gzip stream (see RFC 1952, section 2.2).

When writing the binlog, you end the current gzip stream right before writing the embedded file, then write the embedded file in its own gzip stream (use the Z_FINISH flush level of zlib).
When reading the binlog, you can remember the byte offset in the binlog file for each embedded file, then start decompressing from there when you'll have to access it.

The main problem with this approach is that I don't think GZipStream will be helpful with this (I use my own stream classes which wrap zlib for stuff like this). You'd need a stream which would stop between concatendated gzip streams when calling Read in order to be able to retrieve the correct offset in the binlog (I'm not sure if GZipStream can do this). I suppose writing would be less of a problem, you could just create a new GZipStream instance for each part of the binlog. Also, GZipStream was unable to read concatenated gzip streams on the .NET Framework, it would just stop after the first part (this has been fixed in .NET Core).

Here are my two cents, hoping they can be useful 🙂

KirillOsenkov commented 7 months ago

I made a refactoring that at least moves the logic that reads embedded files to a background thread and reports the time it took in the bottom info bar after the binlog is loaded.

KirillOsenkov / MSBuildStructuredLog