Slow merging of polygons from width check edge pairs

tvt173 commented 1 year ago

Hi @klayoutmatthias ,

I was running into a performance issue on a large layout running a width check, and then using the resultant polygons formed from the edge pairs (essentially I want to check if other polygons are within those polygons or not).

At first, I thought this may be an issue with the width check itself, but after breaking it down, I found that the width check is actually very fast. Transforming the edge-pairs into polygons is also very fast. But merging the resulting polygons took a huge amount of RAM and took hours to execute.

My code in python currently looks like this:

    gaps = original_region.width_check(distance)
    gap_polygons = gaps.polygons()
    # though unnecessary, uncommenting the below line proves that the time is actually being spent on merge, not sizing
    # gap_polygons.merge()
    gap_polygons_buffered = gap_polygons.sized(buffer)

Do you have any suggestions on how to speed this up?

sebastian-goeldi commented 1 year ago

This can be a lot faster if you break it down to tiles. I think both manually doing it or using the tiling processor should be quite fast if it's only merge that takes a long time. (Here I am assuming that your check size is small compared to the region)

klayoutmatthias commented 1 year ago

@tvt173 code is one part, a GDS test case was helpful to discuss this topic. As @sebastian-goeldi pointed out, using tiling mode may speed it up (also because you can easily parallelize it).

When I debugged the GF180 DRC I found a similar case when people used a 15µm distance check to find empty spaces larger than this distance - essentially the idea was that everything where there is no such marker must be empty space of that kind.

I feel you're trying something similar.

In the GF180 case the typical distances of features are more in the order to 500nm. So 15µm will create markers from one feature to many many neighbors - for each feature there are hundreds if not thousands of markers. If you inspect the output you will see a plethora of polygons, overlapping a hundred times. No algorithm can be expected to sort out such a mess quickly. Even tiling mode will only help you to control the memory footprint, but not substantially bring down the processing times.

I don't know what problem you are trying to solve, but I'm pretty sure you can rethink your solution. Again, a test case and a description of the problem was helpful.

Matthias

tvt173 commented 1 year ago

Thanks @klayoutmatthias and @sebastian-goeldi for the feedback... Is it possible to run with the TilingProcessor directly within python, or must I use a ruby script for this?

A workaround I tried (which was much faster) was to substitute the width check with a shrink-grow + negate operation to remove sections less than a given width (and then get those again via negate). But results were slightly different and not quite as good for my needs... The width check results are perfect. The results I get have a few dozen "islands" of results, each with many overlapping edge-pair polygons. I'm thinking what may be a quicker way to do this would be to identify and group each of those islands first, create a new region from each one, and then merge each individual island. Will experiment a bit more...

sebastian-goeldi commented 1 year ago

@tvt173 yes that's possible. I have a more advanced example here https://github.com/gdsfactory/kfactory/blob/main/src/kfactory/utils/violations.py#L53 (sorry for linking kfactory, but it's more or less directly what you want ;) ). In this you can just treat c as a normal Cell of klayout.db and c.kcl is the equivalent of c.layout().

I thought this is now also in the dataprep of gdsfactory, but I can't find it @joamatab . He might also still have the pure klayout examples I wrote for him.

tvt173 commented 1 year ago

nice. thanks for sharing @sebastian-goeldi ! i'll give it a try... i think it might indeed help to speed up my problem!

tvt173 commented 1 year ago

i tried the TilingProcessor, but I didn't see a significant speedup (to put it more accurately, I killed the job after ~30 min, after it became clear it would not finish anytime soon 😄 ) ... it also still maxes out my RAM after a few minutes and does not seem to take advantage of multi cores, even when specified (and yes, I do have enough physical cores). see code below

def _size_with_tiling(region: Region, amount, tile_size=1000, overlap=20):
    tp = TilingProcessor()
    # tp.frame = region.bbox()
    tp.dbu = 0.001
    tp.tile_size(tile_size, tile_size)
    tp.tile_border(overlap, overlap)
    tp.input("reg", region)
    tp.threads = 8
    output = mp.Region()
    tp.output("reg_out", output)
    tp.queue(f"_output(reg_out, reg.sized({amount}))")
    tp.execute("Merging region")
    return output

(I originally wrote for just the merged() operation, but later tried doing the sizing here also...)

It is trying to merge ~336k shapes. I confirmed that trying to merge through the Klayout GUI also crashes klayout. Merging a region that looks to be ~1/20th of that finishes almost instantaneously. But trying to merge ~1/8th of the full GDS similarly crashes the application. (I also tried wiggling around the tiling parameters in the script above a bit, but to no avail...)

I tried with gdstk as a comparison. gdstk successfully merges the full layer within a few seconds (almost instantaneously, relatively...). Here is my script:

import gdstk

input_filename = "input.gds"
output_filename = "output.gds"
print('read gds...')
lib = gdstk.read_gds(input_filename)
top_cell = lib.top_level()[0]
polygons = top_cell.get_polygons(layer=20000, datatype=0)
print('merging...')
merged_polys = gdstk.boolean(polygons, [], operation="or", precision=0.001, layer=20001, datatype=0)
print('writing gds...')
top_cell.add(*merged_polys)
lib.write_gds(output_filename)

It seems like there is either a scaling issue or bug in the merge operation in klayout... It doesn't seem to make sense for it to be taking this long.

Sorry I don't yet have a sample GDS... preparing something shareable would also take some time...

sebastian-goeldi commented 1 year ago

@tvt173 I don't have a solution, but, be careful with tiles and regions. Regions are in dbu and tile size, border etc are in um (I suspect because it comes from drc and was then made availabe in python at a later point).

The fact that you say it's only using 1 core indicates to me that your tile size might be big compared to the are you are trying to tile. Without an example it is hard to help more

klayoutmatthias commented 1 year ago

@tvt173 I'm sorry, no test case, no support. Other users manage to prepare their testcases in a way they can share it and I am talking about top notch companies. You can also send me a layout file privately and I will keep it hidden, that's a deal others can live with.

I don't want to waste my time guessing whether that is a bug or not or what's wrong with the data or the measurement function or other features and I don't want to speculate what's wrong compared to gdstk. And if you think gdstk is better suited, then you're of course free to use that. I just think that's a wasted opportunity and contrary to the open source spirit.

Matthias

tvt173 commented 1 year ago

Thanks @sebastian-goeldi . Yes, I'm also suspicious that I messed something up in the parameters to the TilingProcessor... I get a bit confused sometimes which of these parameters are dbu/microns and keeping the right order of magnitude... Will try to play around a bit more later, maybe on a smaller test case.

Thank you @klayoutmatthias , I didn't mean to be disrespectful of your time. I'm trying to be as helpful as I can by giving the details of what I've tried as I try them, and grounding the results vs a comparable tool I think is an important part of that, to try to get a first sense if this appears to be a problem with Klayout or a difficult-to-handle case in general... But seeing that a comparable tool can do this fairly quickly, leads me to believe that this is a problem with Klayout, and hence I want to be sure to raise it to you, especially since merging is such a fundamental operation... It will be beneficial to all of us to make sure it is efficient as it can be.

Sorry, I will try to prepare a sharable GDS file. It might just take a bit of time.

klayoutmatthias commented 1 year ago

It should not be that way, but I am trying to reproduce the problem with an artificial test case. File is attached here:

issue-1366.zip

This is how I understand your problem (DRC script):

verbose

l1in = input(1, 0)
l2in = input(2, 0)

gaps = l1in.space(1.5.um).polygons
gaps.output(1000, 0)

gaps_merged = gaps.merged
gaps_merged.output(1001, 0)

gaps_sized = gaps_merged.sized(0.5)
gaps_sized.output(1002, 0)

l2_inside = l2in.not_interacting(gaps_sized)
l2_inside.output(1003, 0)

With the input file this gives me the following times consistent with your observation:

"input" in: issue_1366.lydrc:4
    Polygons (raw): 361408 (flat)  361408 (hierarchical)
    Elapsed: 0.010s  Memory: 4056.00M
"input" in: issue_1366.lydrc:5
    Polygons (raw): 9 (flat)  9 (hierarchical)
    Elapsed: 0.010s  Memory: 4056.00M
"space" in: issue_1366.lydrc:7
    Edge pairs: 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 29.980s  Memory: 4293.00M
"polygons" in: issue_1366.lydrc:7
    Polygons (raw): 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 0.970s  Memory: 4459.00M
"output" in: issue_1366.lydrc:8
    Polygons (raw): 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 0.380s  Memory: 4564.00M
"merged" in: issue_1366.lydrc:10
    Polygons (raw): 1 (flat)  1 (hierarchical)
    Elapsed: 210.550s  Memory: 6611.00M
"output" in: issue_1366.lydrc:11
    Polygons (raw): 1 (flat)  1 (hierarchical)
    Elapsed: 0.130s  Memory: 6611.00M
"sized" in: issue_1366.lydrc:13
    Polygons (raw): 1 (flat)  1 (hierarchical)
    Elapsed: 22.070s  Memory: 6611.00M
"output" in: issue_1366.lydrc:14
    Polygons (raw): 1 (flat)  1 (hierarchical)
    Elapsed: 0.030s  Memory: 6611.00M
"not_interacting" in: issue_1366.lydrc:16
    Polygons (raw): 4 (flat)  4 (hierarchical)
    Elapsed: 0.070s  Memory: 6611.00M
"output" in: issue_1366.lydrc:17
    Polygons (raw): 4 (flat)  4 (hierarchical)
    Elapsed: 0.020s  Memory: 6611.00M
Total elapsed: 264.710s  Memory: 6611.00M

Layer 1000 looks like that:

Layer 1001 is a hug polygon with about 6M vertices any a million holes:

I assume the main performance killer is the hole formation. Point taken.

If my picture is correct, then it should be possible to optimize by turning the interaction check around:

verbose

l1in = input(1, 0)
l2in = input(2, 0)

gaps = l1in.space(1.5.um).polygons
gaps.output(1000, 0)

l2_inside = l2in.sized(0.5).not_interacting(gaps)
l2_inside.output(1003, 0)

This gives me:

"input" in: issue_1366b.lydrc:4
    Polygons (raw): 361408 (flat)  361408 (hierarchical)
    Elapsed: 0.010s  Memory: 3325.00M
"input" in: issue_1366b.lydrc:5
    Polygons (raw): 9 (flat)  9 (hierarchical)
    Elapsed: 0.010s  Memory: 3325.00M
"space" in: issue_1366b.lydrc:7
    Edge pairs: 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 30.020s  Memory: 3580.00M
"polygons" in: issue_1366b.lydrc:7
    Polygons (raw): 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 1.000s  Memory: 3746.00M
"output" in: issue_1366b.lydrc:8
    Polygons (raw): 2159306 (flat)  2159306 (hierarchical)
    Elapsed: 0.390s  Memory: 3861.00M
"sized" in: issue_1366b.lydrc:10
    Polygons (raw): 9 (flat)  9 (hierarchical)
    Elapsed: 0.070s  Memory: 3861.00M
"not_interacting" in: issue_1366b.lydrc:10
    Polygons (raw): 2 (flat)  2 (hierarchical)
    Elapsed: 1.470s  Memory: 3945.00M
"output" in: issue_1366b.lydrc:11
    Polygons (raw): 2 (flat)  2 (hierarchical)
    Elapsed: 0.040s  Memory: 3945.00M
Total elapsed: 33.370s  Memory: 3945.00M

The space check can be optimized by skipping shielding:

gaps = l1in.space(1.5.um, transparent).polygons

which reduces the "space" runtime to 6.7s and the total time to 10.14s.

So bottom line is that it's "merge" which generates a huge polygon. Even if the formation of that polygon can be optimized and made fast (I guess so), processing such a polygon further is painful. The key to optimization is rethinking the solution in terms of local operations. "size" isn't one for large polygons with many holes.

Matthias

tvt173 commented 1 year ago

Hi @klayoutmatthias ,

Sorry for the delay and the back and forth... Here is a stripped down test case. For me this took about 5 min and 3~4 GB of RAM to merge through the Klayout GUI (Edit -> Layer -> Merge).

In gdstk (the script I shared above), it finished in ~0.16s with negligible memory consumption.

The uncompressed file is only 44 KB ( I only compress here because Github will not allow me to share a gds file 😄 ). I haven't yet read thoroughly through your comment above, so I'm not yet sure if our two cases are similar, but hopefully this example file helps pinpoint the issue.

debug_merge_8.zip

Thanks, Troy

sebastian-goeldi commented 1 year ago

6M vertices :eyes:

Also, I think to truly make this problem scalable it would probably help to not only do the sizing in the TilingProcessor but the whole process, i.e. the width check and then the sizing operation (just set the border size of the tiles accordingly. Also, unless I am wrong (please correct me Matthias ;) ), gap_polygons_buffered = gap_polygons.sized(buffer) <- this will copy your whole region, so if one region is at the edge of what your memory can tolerate, your swap/pagefile will suffer here ;) (even if only for a short time). unless you really still need the unaltered region object, gap_polygons_buffered = gap_polygons.size(buffer) should be what you probably want.

And lastly: I noticed TilingProcessors are faster if you let them do the assembling of regions by only passing them the layer index & cell_index (this guy here TilingProcessor(string name,const Layout layout,unsigned int cell_index,unsigned int layer) (whenever possible ofc). I don't have evidence for this, but just something I noticed when writing some violation fixes.

tvt173 commented 1 year ago

To call out some similarities and differences between your test case and mine

Mine also produced by outputting width_check(...).polygons() of a structure
No holes in this example
Curvilinear edges/very high number of overlapping polygons is potentially problematic
Kissing corner in the middle appears to possibly be part of the problem. Other similar-looking cases without it ran quickly. But simply isolating and merging the ~3 polygons near the corner do not reproduce the issue. This was the smallest example I was able to isolate.

sebastian-goeldi commented 1 year ago

Interesting, I just ran it and can confirm.

Here's my little python script I used

import klayout.db as kdb

from time import time

l = kdb.Layout()
l.read("debug_merge_8.gds")

reg = kdb.Region(l.top_cell().begin_shapes_rec(l.layer(20000, 0)))

reg_merged = kdb.Region()

t1 = time()
reg_merged2 = reg.merged()
t2 = time()

for poly in reg.each():
    reg_merged += kdb.Region(poly)
    reg_merged.merge()
t3 = time()

print(f"{t2-t1=}, {t3-t2=}")
print(f"{(reg_merged == reg_merged2)=}")

print(reg_merged)
print(reg_merged2)

t2-t1 is the normal merge [sec] t3-t2 is the bootleg merge of adding one polygon at the time [sec]

The runtime is vastly different. Also of note the two regions aren't exactly the same (I assume some off-grid intersection points get rounded slightly different).

t2-t1=144.92284774780273, t3-t2=0.03259611129760742
(reg_merged == reg_merged2)=False

Additional info:

Using one of the latest master commits as the python module.
python 3.11

tvt173 commented 1 year ago

Thanks for confirming, Sebastian. I tried your "bootleg merge" as a potential workaround for the time being, but unfortunately that python loop struggles too in the real case with ~500k shapes to iterate over 😄 ...

However, memory consumption does not blow up in the bootleg case, as it does in the plain merge case. I note that there is functionality in this method to restrict the merge based on number of overlaps... I'm wondering if that is maybe what is blowing it up, in this case where there are so many overlaps, if perhaps it is something like an O(m * n) operation, where m is # overlaps and n is # shapes. If that is the case, maybe it would be worthwhile to have an optimized version in the case of a simple merge, where we aren't counting number of overlaps? (only speculating, but I'd be happy if you could educate me on what's actually happening)

klayoutmatthias commented 1 year ago

Thanks for the test case first of all.

As far as I see there is no particular thing that goes wrong. The booleans are simply not made for that kind of data.

Basically the scanline algorithm decomposes the area into smaller regions bounded by non-intersecting edges. In your case, the millionfold edge intersections are all going to be resolved which creates a huge number of temporary edges and wastes a lot of CPU only to find finally there is hardly anything do to.

The advantage of the current approach is to be generic and suitable for a large number of applications. One code fits all.

I'm sure something can be done, but that will need a redesign of the boolean engine.

Matthias

tvt173 commented 1 year ago

I see... Thanks for taking a look, Matthias. I understand that this case is probably somewhat atypical, in that there is such a high density of overlapping shapes, which seems to be causing this issue. But I think for this particular use case, of wanting to get the aggregate error shapes highlighted by a width or space check, the current behavior of the merge function should often be pretty non-ideal, especially for curvilinear geometries, no? For me, this is something I do pretty commonly. For example, even just for viewing the errors, you can see that these ~500 edge pairs actually just represent one small region. I'd rather merge them into polygons first before viewing them, such that I only have a handful of errors to browse through, rather than thousands.

I'm wondering if maybe there is a more efficient way already to do this in klayout (i.e. maybe we can merge the edge pairs first before converting to polygons). Or maybe a simpler approach like @sebastian-goeldi 's "bootleg merge" would be more efficient in this case, and it could be used in a new method to convert edge pairs to merged polygons. I'd also be interested to take a closer look at what gdstk is doing (I believe the clipper library actually does those booleans under the hood) for contrast.

klayoutmatthias commented 1 year ago

I'm up to something ... maybe there is actually something I can do.

I'll keep you updated.

Matthias

klayoutmatthias commented 1 year ago

With small code change I got the times down to 1.4s (i7-11800H @ 2.30GHz) with a memory foot print of 16MB for @tvt173 testcase. Originally the runtime was 302s at a memory footprint of 1.1G on my machine.

That's not the 0.16s yet from gdspy, but a lot closer.

Matthias

klayoutmatthias commented 1 year ago

Another update: Down to 0.45s, maintaining +16M memory footprint ... :)

I guess there is not much I can do beyond that without compromising the structure of the code.

tvt173 commented 1 year ago

that's awesome, @klayoutmatthias ! I'm curious, what was the change? and under which conditions should we expect a speedup vs before?

klayoutmatthias commented 1 year ago

Actually you were right in a way that scaling wasn't good. The scanline algorithm works on cells that are formed by sorting edges in y and x direction and focussing on that specific region. In your testcase with the manifold edge intersections at low angles, this region would cover many edges and specifically many pairs of edges are visited multiple times.

The first optimization was to consider each pair of edges only once and specifically inserting their intersection point only once into the list of intersections. This dramatically reduced the memory footprint and also gave the first performance boost.

The second fix was to somewhat optimize the way intersection points would consider neighboring edges: due to snapping the edge topology can change. To control this effect, intersection points can capture edges running close by and in order to find these edges another loop is needed within each scanline cell. This loop now considers each intersection point once which gave roughly a 3x performance improvement.

Overall the extreme effect is only seen in your case - the other case I have in my test suite show some effect, but a much less pronounced one. For example, the Gerber import also performs a merge between manifold overlapping layouts created by expanding the aperture paths. I see somewhat like a 20% performance effect.

Still it is good to have improved the performance in general as the situation is not entirely unknown. Yet I'd try to avoid it if possible.

Matthias

tvt173 commented 1 year ago

Awesome, thanks for the help and explanation!

tvt173 commented 1 year ago

Thanks again for your help here @klayoutmatthias . Just thought I'd give an update... After running 0.28.8 on the full GDS file, I see runtime has dropped to roughly 2 min. So it is still about an order of magnitude slower than gdstk (which finished in a few seconds), but it is much, much faster than before (now it is definitely a tolerable amount of time for the application and makes it possible to keep the whole thing in klayout). Thanks!

klayoutmatthias commented 1 year ago

That's basically good news :)

I can easily make things faster by dropping the step I mentioned above (attracting foreign edges running close to an intersection point). That step however is key to deterministic snapping behavior and topological correctness of the output. I think your case is not quite representative and in general the expected average performance gain IMHO does not justify sacrificing these benefits. I did not see a single failing boolean in the last couple of years and I mostly attribute that to that edge treatment (proof missing, but solid gut feeling).

Another plus is that I understand this code and I am able to maintain it. Which I can't claim for boost.polygon for example.

I will keep an eye on this though. At least there is a precious test case now :)

Matthias

KLayout / klayout

Slow merging of polygons from width check edge pairs #1366