Reduce compressed size of files in results/ folder, especially .odb files

oharboe commented 6 months ago

Description

When working with large, long running designs, .odb compressed file sizes start to matter.

Especially when creating artifacts that are uploaded/downloaded to artifact servers, such as Bazel.

I take it as read that whatever artifact system that is being used has compression built in, so it is not the raw .odb file size, but the compressed .odb file size that needs to be reduced.

As far as this feature request is concerned, the file size on disk is not a major concern, upload/download speeds is.

Additional Context

No response

QuantamHD commented 6 months ago

We could theoretically add a gz option to the writer.

oharboe commented 6 months ago

We could theoretically add a gz option to the writer.

Surely Bazel , or whatever artifact system is being used, has compression already....

QuantamHD commented 6 months ago

Looks like not by default. https://bazel.build/reference/command-line-reference#flag--remote_cache_compression, and there's zero chance that CMake and or make are doing any kind of artifact compression.

https://www.buildbuddy.io/blog/how-bazel-5-0-makes-your-builds-faster/

oharboe commented 6 months ago

Google Cloud Buckets can be configured to do compression, so this might not even be the job of the Bazel.

Certainly before changing ORFS/OpenROAD, we should try to find out if this problem is solved elsewhere for important use-cases.

A long time ago, I tried to reduce disk space for ORFS by using symbolic links instead of copying .odb files, but make doesn't play nice with symbolic links.

For my wafer thin Bazel layer on top of ORFS, I only keep one .odb file per floorplan, place, cts, route, final stages.

QuantamHD commented 6 months ago

My understand of that that feature is that it reduces at rest storage costs. GCS decompresses it on the fly over the wire so you're paying the full network bandwidth cost. https://cloud.google.com/storage/docs/transcoding

What bottle neck are you actually running into?

Are your builds slow because remote artifact downloads are slow?
Are you running into high storage prices?
Is bazel failing to download some large file?

More details would help narrow the solution space. There's not a lot of easy size optimizations we could make in odb beyond adding compression.

oharboe commented 6 months ago

I read a bit about Bazel. It supports compression, but also de-duplication. So if .odb files were reworked to work well the deduplication & built in Bazel compression there is a significant savings potential.

Requires significant digging to figure out what this means... Split into multiple files so that shared data between .odb files can be detected by artifact systems to be duplicate?

git too has an artifact system, which compresses by default.

QuantamHD commented 6 months ago

I think the reality is that would be a lot of work and complexity for gains on very specific systems. I don't think the juice is worth the squeeze.

What are you primarily trying to optimize? Storage cost or fetch speed?

I would also advise against storing odb files or other artifact like things in git. There's not a great solution for long term caching of these type of files, but putting them in git will forever store them in your history, and will quickly run you into platform repo limits on git hosting websites.

oharboe commented 6 months ago

I take as a default I assume that it is unecessary to add compression to ORFS and that the job of ORFS is to enable existing compression(whereever it comes from) to be able to work effectively.

Regarding google cloud: gsutil has a -Z and a -J compression option. The first is to compress on the server, the second to compress on the client (over the wire). W.r.t. ORFS, I assume that Bazel has a way to exploit this.

QuantamHD commented 6 months ago

Yeah I think bazel is pretty agnostic to storage. I think the only way you get compression is with that flag I mentioned previously. Also my proposal is not to add compression to ORFS, but inside openroad so if you say run

write_db mydb.odb.gz it will be streamed into a gzip. That would let you interact with pretty much any build system.

oharboe commented 6 months ago

I think the reality is that would be a lot of work and complexity for gains on very specific systems. I don't think the juice is worth the squeeze.

What are you primarily trying to optimize? Storage cost or fetch speed?

I would also advise against storing odb files or other artifact like things in git. There's not a great solution for long term caching of these type of files, but putting them in git will forever store them in your history, and will quickly run you into platform repo limits on git hosting websites.

I don't propose to store .odb files in git. I was just using it as an example that one can expect artifact systems to have some sort of built in compression for transmission over the wire.

My main concern is transmission speeds.

Storage costs I would mainly manage with pruning old builds.

oharboe commented 6 months ago

Yeah I think bazel is pretty agnostic to storage. I think the only way you get compression is with that flag I mentioned previously.

Nice find! I will experiment with it, it can be added to the .bazelrc under version control in my project alongside WORKSPACE.bazel

Also my proposal is not to add compression to ORFS, but inside openroad so if you say run

write_db mydb.odb.gz it will be streamed into a gzip. That would let you interact with pretty much any build system.

Perhaps someone will benefit from write_db mydb.odb.gz, but it doesn't address my concern in this feature request, so I consider it a separate feature request.

rovinski commented 6 months ago

OR already has support for GZip on LEF, DEF, and SPEF. I don't think it would be too much to ask for ODB as well.

oharboe commented 6 months ago

OR already has support for GZip on LEF, DEF, and SPEF. I don't think it would be too much to ask for ODB as well.

Sure, but not relevant to this feature request. This feature request is about reducing the compressed size. Compression happens outside of OpenROAD already.

rovinski commented 6 months ago

Ah I did not read the edited original post. I mean, I don't really know what to do there, other than turning up the compression effort on GZip which of course has computation time implications.

oharboe commented 6 months ago

Ah I did not read the edited original post. I mean, I don't really know what to do there, other than turning up the compression effort on GZip which of course has computation time implications.

There was some talk to have a scheme to store repeating structures, introduced in floorplanning, filling, etc. more efficiently.

maliberty commented 6 months ago

A first step would be to have some reporting of the size of the various sections of the file. My guess is detailed routing will dominate but it would be good to verify that. PDN is another possibility.

QuantamHD commented 6 months ago

I did some experiments we can reduce the odb size by ~50% if we move to a VLQ encoding for uint32_t in dbStream.h.

see

void dbOStream::writeUnsignedVlq(uint32_t c)
{
  if (c == 0) {
    _f.put(0);
    return;
  }

  // Octets store 7 signal bits and 1
  // continuation bit in the MSB slot.
  char mast_7bit = 0b1111111;
  char vlq = 0;
  while (c != 0) {
    char vlq = c & mast_7bit;
    c >>= 7;
    if (c != 0) {
      vlq |= 0b10000000;
    }
    _f << vlq;
  }
}

On Jpeg sky130 the routed odb goes from 360Mb to 180MB with VLQs. If you use ZSTD compression on the file it shrinks to about 37MB,

oharboe commented 6 months ago

On Jpeg sky130 the routed odb goes from 360Mb to 180MB. ZSTD compression is about 37MB,

Nice!

What was the difference in conmpressed size?

QuantamHD commented 6 months ago

1-2MB

oharboe commented 6 months ago

1-2MB

I see...

Not to be a party pooper, but then it looks like it is better to just leave this to generic compression than to add make OpenROAD code more complicated?

Unless this is simple and fast, in which case it is a win for the uncompressed case, i.e. normal ORFS flow.

QuantamHD commented 6 months ago

Yeah I think adding a streaming zstd encoder by default makes the most sense. It's already in boost so it'll be an easy add. It's pretty fast 500MB+ encode speed and 2500MB/s read

oharboe commented 6 months ago

Yeah I think adding a streaming zstd encoder by default makes the most sense. It's already in boost so it'll be an easy add. It's pretty fast 500MB+ encode speed and 2500MB/s read

I see. Point of order to be nit-picking: this feature request is about reducing compressed size.

maliberty commented 6 months ago

I looked at bit over the weekend and the most obvious place is the dbGCellGrid could be more efficient. Its hard to predict how much that will affect compressed size though as the compression might already be getting those gains.

rovinski commented 6 months ago

Maybe there could be a compressed structure for arrays of fill cells? Something that just stores {x_origin, y_origin, x_pitch, y_pitch, x_count, y_count} plus some way of incrementing the instance name.

oharboe commented 6 months ago

Maybe there could be a compressed structure for arrays of fill cells? Something that just stores {x_origin, y_origin, x_pitch, y_pitch, x_count, y_count} plus some way of incrementing the instance name.

Would that be better than zstd can do now?

rovinski commented 6 months ago

Only way to find out is to try 🤷‍♂️ Any manually coded scheme should beat a dictionary-based encoder, but it's a question of how well.

oharboe commented 5 months ago

Just as a datapoint, bsdiff is not practical, took hours.

$ bsdiff 3_place.odb 4_cts.odb patchfile
$ zip xx patchfile 
updating: patchfile (deflated 0%)
$ unzip -lv xx.zip 
Archive:  xx.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
1820437880  Defl:N 203521160  89% 2023-12-31 15:14 d9f94a40  3_place.odb
1845729152  Defl:N 209114483  89% 2023-12-29 02:07 9859b31c  4_cts.odb
77743457  Defl:N 77694175   0% 2023-12-31 20:47 ffdfc498  patchfile
--------          -------  ---                            -------
3743910489         490329818  87%                            3 files

Diff in size for 3_place and 4_cts is 1845729152-1820437880=25291272 bytes=25mBytes

bsdiff is 77743457 = 77mBytes

the compressed size of 4_cts is ca. 185mByte.

Preliminary conclusion: bsdiff can reduce compressed size for the 4_cts by 2x.

Given the disadvantages and complications of a binary diff approach, this isn't particularly promising.

oharboe commented 4 months ago

@maliberty An idea for macro placement specifically: write out a placemen.tcl, which fully describes the result of the .odb file from macro placement and read it back in in the next step. This completely eliminates the need for an .odb file for macro placement and also makes it easier to see what is going on in the macro placement stage as the placement.tcl doubles as a report.

This idea alone isn't particularly exciting unless it is more broadly applicable though. Are there other stages that can similarly be described by a small .tcl file?

maliberty commented 4 months ago

Not really. Even macro placement currently assigns std cell locations as well so it wouldn't be that small.

oharboe commented 2 months ago

@maliberty Close?

I think we covered the difference in significance between size of compressed and uncompressed .odb files and that this is now well understood. Uncompressed size helps runtime, compressed size helps network speed, disk size isn't particularly important.

There is no specific action or idea here, compressed size can always be improved...

maliberty commented 2 months ago

I'm fine to close. The gcell data PR will help the uncompressed size.

The-OpenROAD-Project / OpenROAD

Reduce compressed size of files in results/ folder, especially .odb files #4241

Description

Suggested Solution

Additional Context