Convert output data into several files

locke-lunarg commented 1 year ago

Sometimes, convert files could be bigger than 10 Gb. It takes long time to process, and the output files are too big to open. Maybe we could output data into several files. For example, every 1000 frames into one file.

andrew-lunarg commented 1 year ago

I think it's a good idea but it seems possible to do the chunking into different files outside of Convert. I see several options.:

New file every k lines.
New file every k frames.
New file every k MB.
New file every k seconds of Convert runtime.
A combination of the above (have max values for any or all of them ).

To cut by lines the tool split can be used on mac, linux, and WSL on windows. This chunks the output into 4000 line pieces:

gfxrecon-convert --output stdout capture.gfxr | split -l 4000 -d - converted.jsonl.

split can also chunk by bytes but not respecting line boundaries I think.

A script in Python could do the chunking by k frames by matching on the presents. This example doesn't quite do that but it could be extended to: https://github.com/LunarG/gfxreconstruct/tree/dev/tools/convert#truncate-at-match.

For cases that you know roughly what part of a file you are interested in you can use line-oriented tools head and tail to cut the output down, or search for the frame boundaries and kill Convert after the right number of them from simple scripts.

andrew-lunarg commented 1 year ago

Another measure for large Conversions is to pipe output through a grep for the things you are interested in. E.g., a bunch of function names and some resource handle IDs. Alternatively pipe it through a series of grep -v stages to screen out lines that you know you aren't interested in for the debugging task at hand. Lines with massive arrays can be caught if they have distinctive structure member names that an inverse grep can match on.

perim commented 1 year ago

Just compress the output? Lots of tools that can operate directly on compressed files, eg zgrep. These files compress really well and become more manageable.

andrew-lunarg commented 1 year ago

Thanks for the zgrep tip. If I knew it, I'd forgotten.

Writing directly to a zip file is viable. I just ran convert on a 43G binary capture, piping the output through gzip before it hit the drive.

It took 7m, 3s on my laptop versus 6m, 4s writing to stdout and redirecting that to a file.
Convert was reading the binary file at 90 to 100 MiB/s, so slowed down versus writing to stdout and redirecting that to a file but not hugely.
The result takes 2.1G of space (not as small as I was hoping), versus 30G uncompressed.
I can zgrep the file for 44123 lines with Present on them in 1m 25s (time zgrep Present big_capture.gfxr.43G.jsonl.zip | wc -l) versus 25 seconds flat for grep on the uncompressed 30G file.

locke-lunarg commented 1 year ago

~~I just opened a 3Gb convert file(.jsonl) with VSCode, and re-save it. It became 6Mb. I don't know what VSCode did. It's possible some kind of compress. I think compress could help it.~~

bradgrantham-lunarg commented 1 year ago

I just opened a 3Gb convert file(.jsonl) with VSCode, and re-save it. It became 6Mb.

Loading and saving in VSCode here didn't do anything to the file (saved was same as original). I could believe 10X with compression, but 50X is starting to be unbelievable unless it's the exact same long Vulkan command over and over. Is it still readable in a text editor or with "head"? What does "file" say the file is? Can we maybe examine a copy of the original and the saved version from VSCode?

andrew-lunarg commented 1 year ago

I was just experimenting too, on a 1.5G JSON Lines file. I didn't want to do what it asked to finish the test:

vim opens it in about 5 seconds using 1.7GB of memory.

locke-lunarg commented 1 year ago

Sorry, my bad. The data size didn't change. VSCode took time to save data. The new data size actually increased slowly until the original size.

andrew-lunarg commented 1 year ago

export has a --file-per-frame option that will work for Vulkan and D3D12 once integrated.

andrew-lunarg commented 1 year ago

@locke-lunarg #1050 has just been merged and added an option for one frame per file. Is that enough for you or do you still want k frames per file added?

LunarG / gfxreconstruct

Convert output data into several files #896