KirillOsenkov / MSBuildStructuredLog

A logger for MSBuild that records a structured representation of executed targets, tasks, property and item values.
MIT License
1.41k stars 188 forks source link

[Feature, Perf] Audit and optimize the embedded files accessing code #714

Open JanKrivanek opened 7 months ago

JanKrivanek commented 7 months ago

Context

Inspired byt the discussion here: https://github.com/KirillOsenkov/MSBuildStructuredLog/pull/711#discussion_r1382701666 Embedded files are currently greedely fully read into memory during opening of binlog - while they might never be accessed.

Gotchas

Embedded files are a ziparchive which is within the zipstream of a binlog - acessing those later on would require one of those optins:

Each of those options have significant downsides. The optimal way would need to be tested.

Alternative

Redesigning the binlog format. E.g.: compressed events stream and files zip archive would be two independent streams within single file. The file would have few empty bytes prealocated on the begining and those would then be overwritten as the binlog would be writen:

Other possible alternations to the format can be done at the same time to optimize the compression ratio, quicker redaction workflows etc. - e.g.:

FYI @rokonec - he was incepting some of those ideas

KirillOsenkov commented 7 months ago

Yes, I've been thinking about redesigning the format as well. I thought we could write out each file individually like we write out individual strings, and no need for nested compression. Basically have a new kind of record "File" that's literally two strings, one for file path and the other is for the content (uncompressed). Maybe also the timestamp and attributes (or other metadata?). The outer stream would compress anyway. Need to measure how the performance of this approach would compare to the current one, and what the binlog file size would be. The upside of this approach is that if the build crashes or is interrupted in the middle, we still get all the records, strings and files embedded so far.

Another idea would be to unzip the file archive when opening the binlog, as usual, and then immediately re-zip each file individually, and store the compressed byte[] for each file. Unpack each file text on demand. This would spike memory and CPU when reading the binlog initially, but would settle to lower memory consumption of the viewer after it's open. Also seems like the cheapest approach.

Or we could store records, strings, name-value lists and files in four separate streams and concatenate them at the end. The downside is if the build fails in the middle, we lose everything.

ltrzesniewski commented 7 months ago

I don't know the binlog file format, but since it's a gzipped stream I may suggest an idea: maybe you could use the fact that two concatenated gzip streams are a valid gzip stream (see RFC 1952, section 2.2).

The main problem with this approach is that I don't think GZipStream will be helpful with this (I use my own stream classes which wrap zlib for stuff like this). You'd need a stream which would stop between concatendated gzip streams when calling Read in order to be able to retrieve the correct offset in the binlog (I'm not sure if GZipStream can do this). I suppose writing would be less of a problem, you could just create a new GZipStream instance for each part of the binlog. Also, GZipStream was unable to read concatenated gzip streams on the .NET Framework, it would just stop after the first part (this has been fixed in .NET Core).

Here are my two cents, hoping they can be useful 🙂

KirillOsenkov commented 7 months ago

I made a refactoring that at least moves the logic that reads embedded files to a background thread and reports the time it took in the bottom info bar after the binlog is loaded.