flaport / inverse_design

https://flaport.github.io/inverse_design
Apache License 2.0
16 stars 5 forks source link

Slow execution of the generator #1

Closed jan-david-fischbach closed 1 year ago

jan-david-fischbach commented 1 year ago

Hey @flaport, Quite an interesting project! Probably my configuration is not quite right, but when I try running the inverse design executing the generator takes about 15 minutes (I am on a M1 MacBook Pro), which makes this quite unfeasible. Do you experience similar execution times, or is there something wrong with my setup?

jan-david-fischbach commented 1 year ago

For reference, I end up with 7 seconds for 128x128, when using a naive numpy and scipy.ndimage.convolve implementation

flaport commented 1 year ago

Hey @Jan-David-Black , that's pretty awesome! It's been a while since I worked on this so I don't fully remember the details, but I do remember that that the generator took too long (although I don't think it took 15 min for me). I was gonna give it an other shot another day but clearly that didn't happen...

Would you mind if I replace my implementation by yours in this repo? If you agree, maybe you can add the notebook you linked as a PR for proper attribution?

Anyway, thanks for pointing to a better solution :)

jan-david-fischbach commented 1 year ago

@flaport I don't quite understand your code in 07_inverse_design.ipynb. You seem to define a forward function however in the optimization scheme it is not used. When I try to naively "plug it in" where it seems to belong I run into issues with autograd and jax not playing nice.

flaport commented 1 year ago

Yep, this repository is one of my many unfinished projects 😅

jan-david-fischbach commented 1 year ago

I seem to have gotten it to work: https://jan-david-black.github.io/inverse_design_strict_fabrication/notebooks/inverse_design_local.html It is still quite rough around the edges: Because the outer regions are just masked away the fabrication constraints are violated in some parts around the design region border. I mostly struggled with differing versions of nbdev :/ The generator speed is still impacting the overall runtime quite heavily, especially as it scales worse than linear with pixel count at the moment (most parts do scale linear at the moment, just some bits missing).

flaport commented 1 year ago

Yeah, this repository uses nbdev<2. The more recent version is incompatible with the old one...

Thanks for getting it to work! I'll check it out soon :)

Are you using your generator or mine? Maybe if i ever find some time i might rewrite the generator in some faster language

jan-david-fischbach commented 1 year ago

I am running my generator, which I have extended to use local dilations, wherever easily possible. I believe that Jax might be used to speed up the local dilations using GPU acceleration. I cannot test that at the moment, however, as I do not have a suitable GPU at hand. I am still struggling a bit with finding resolving touches after free touches have been applied (I think it makes sense to just calculate it once after all free touches, but I have to implement finding the extent for the local dilation still, as it depends on the positions of all free touches.) Apart from that I still have a crappy implementation of the heatmap for selecting the next touch (currently involving a convolution in every step). Here is some timing analysis of the current implementation:

The lower graph is summarized by its sum as "linear_ops" in the upper one. Resolving in the upper graph stands for calculating the resolving touches after free touches. It coincides almost perfectly with the time needed for global dilations. A brush of 9-diameter is used.

total_time linear_time

lucasgrjn commented 1 year ago

Hi!

I gave a look at inverse_design repo this summer after @flaport shared it. Unfortunately, I was very busy with some nanofabrication. Now, I am more available and determined to make it work. So @Jan-David-Black I hope to be able to help you.

I just finished setting up my env to have Jax-GPU enabled on my desktop. If you want to do some testing, I can already do it. I may also have an idea to speed up the convolutions, I'll do some tests tomorrow.

Regards, Lucas :)

jan-david-fischbach commented 1 year ago

I have published a small package here: https://pypi.org/project/javiche/ that provides a wrapper decorator to easily use ceviche autograd differentiable functions with jax. I would like to make it a dependency going forward, to beautify/clean up the interaction between jax and ceviche in the inverse design notebooks a bit. Would that be ok with you @flaport ?

flaport commented 1 year ago

Sure! Go ahead :)

flaport commented 1 year ago

I went a little bit crazy today and re-wrote the generator in rust (using arrayfire). It should be a drop-in replacement (after compilation). It's about 10 times faster for a 30x30 grid but gets progressively faster (relatively speaking) the larger the grid becomes.

code-wise it's still highly unoptimized I think (I basically just translated my crappy python code), but it's pretty cool to see an actual substantial performance increase. Feel free to check it out... :)

jan-david-fischbach commented 1 year ago

Wow really cool!! Still need to try it out. As far as I can see you still use global dilations, correct? So we might speed it up further using local dilations?

jan-david-fischbach commented 1 year ago

I have been trying to get your rust implementation to work on my M1 mac. Somehow I can only get the arrayfire binaries to work from within a x86 compiled program. Therefore I have to run it behind rosetta, which I believe plummets the performance in this case: For an exemplary 128x128 map the rust generator takes approximately 77s. Could you maybe report on the time it takes on your system?

flaport commented 1 year ago

Yeah, for me it took about 10 seconds, but let's be fair... that's still too long.

That's why today I chose to completely rewrite the generator from scratch (still in rust, but without arrayfire). The new version is is 'blazingly fast': 128x128 takes about 200ms.

It's a new implementation which only works in the vicinity of the brush and does not use any convolutions and scales therefore a lot better with the size of the grid.

Moreover, I no longer use arrayfire and hence it should be easier to compile for you :)

The new version was forced pushed to master (sorry if you pulled already!), the old version still is in the rust-v1 branch.

lucasgrjn commented 1 year ago

Awesome!

I actually have never used Rust. To give it a try, I simply need to use a make build within the Rust part of the repo ?

flaport commented 1 year ago

Yeah, install rust and cargo first with a tool like https://rustup.rs. Then you can build the repo with a 'cargo build' when you're inside the rust folder. That build command creates the .so file needed for importing in Python too (it should appear somewhere in the 'target' folder created by cargo).

But once you confirmed cargo works, you can just run a 'make lib' in the root of the repo. That will move the .so file to a more convenient location.

That said, on windows the file rust builds might have a different extension, so we might have to make the make commands more cross platform

jan-david-fischbach commented 1 year ago

How about using maturin?

lucasgrjn commented 1 year ago

I took some time to really dig deeply into your code @flaport.

I just have a question about the Python part which can be a bottleneck: https://github.com/flaport/inverse_design/blob/5abf24d69953ffb918900cc9a3e2946425b317f9/inverse_design/design.py#L97-L109 Why do you not use a bool mask instead of this function ? I am asking this just in case but, since you were able to cut the time so drastically, I dont think it will be useful. (Except if you use a similar function in Rust ! (For this part, I also take a look. But... I am not really familiar with this language, so, I prefer not to comment this part.)

:)

flaport commented 1 year ago

How about using maturin?

Cool! I will have to look into this :)

Why do you not use a bool mask instead of this function ?

Yeah... that function is horrible. I even forgot what I was doing there...

I think (apart from that function) the python code is an ok implementation if you want to understand what's going on but the implementation is pretty horrible. Too many global convolutions (dilations) and re-calculations.

The new implementation is much better... it does not re-compute anything and when it needs to dilate anything it stays close to the brush touch (I think that's what @Jan-David-Black meant with local dilations?). The new implementation is much better.

That said, I understand it's a lot more difficult to understand what's going on in an unfamiliar language, but rust is pretty cool... re-writing parts of existing python code in Rust is my preferred way to learn it better :)

jan-david-fischbach commented 1 year ago

The new implementation is much better... it does not re-compute anything and when it needs to dilate anything it stays close to the brush touch (I think that's what @Jan-David-Black meant with local dilations?). The new implementation is much better.

Jep exactly It is also what I had started in my "local generator" in python which takes ~2s for 128x128 even though i still use global dilations in some places...

I got your new rust-generator to work much more easily (no more arrayfire 🎉). I'll open a pull request to share my maturin configuration, which also makes adding paths and copying around files obsolete :)

flaport commented 1 year ago

I got your new rust-generator to work much more easily (no more arrayfire tada). I'll open a pull request to share my maturin configuration, which also makes adding paths and copying around files obsolete :)

Awesome!

One caveat is that the current version of my algorithm scales pretty badly with brush size. But at least it's better than scaling with the grid size I think.

lucasgrjn commented 1 year ago

Maybe we could make a small benchmark to see the evolution as function of the size (object and brush ?) TBH really need speed improvements! You rock :) It cost me less than 20s for a latent design generation of Figure.6 size (640px with a circular_brush of 10px)

That said, I understand it's a lot more difficult to understand what's going on in an unfamiliar language, but rust is pretty cool... re-writing parts of existing python code in Rust is my preferred way to learn it better :)

Learning Rust was on my TODO list but you just underlined me how it is cool!

jan-david-fischbach commented 1 year ago

I have also added a notebook that uses ceviche_challenges in the optimization loop here So far the optimization is quite bad... Two things we might add to the generator:

lucasgrjn commented 1 year ago

@flaport if you have some spare times: do you plane to add some comments on you Rust code to explain the main idea of the different parts ? (I think you use a big brush and then a very brush on the voxel of the big brush). -> It is more to understand the algorithm and its possible caveats rather than the syntax

flaport commented 1 year ago

Yes, I'll try to add some comments soon.

To answer your question related to the brushes:

jan-david-fischbach commented 1 year ago

I did it exactly the same way 👍

lucasgrjn commented 1 year ago

I have also added a notebook that uses ceviche_challenges in the optimization loop here So far the optimization is quite bad... Two things we might add to the generator:

I will try to make an example tomorrow also using ceviche_challenges.

* Initial touches: I use a set of solid and void touches at the start of the generation process to ensure the fabrication constraints are also obeyed at the borders and waveguide ports ([e.g. here](https://jan-david-black.github.io/inverse_design_strict_fabrication/notebooks/inverse_design_local.html))

* Symmetry constraints: I still need to figure out how to do this best

Cool to see you arleady done what you explain me Friday ! For the simmetries, it wont be easy...

  • The idea to force touch on the symmetry boundary wont work since it will avoid or force one type...
  • With a symmetry an half brush size on the boundary will be allowed but it wont be authorized on a half part optimization...
lucasgrjn commented 1 year ago

Thanks for the explanations !

* the very big brush (ideally the big brush convolved by the brush) is used as a search area for free and resolving touches.

This is the point where I think it may be possible to make use of a mask, I will try to investigate.

flaport commented 1 year ago

Here is an initial performance graph

image

I don't like how the algorithm scales with brush size at all, but I'm sure there are tons of optimizations possible still. Not to mention that everything works on a single thread currently. Maybe there are parallelizations possible too.

jan-david-fischbach commented 1 year ago

I currently have the following problem. Maybe you know a quick fix: My fork has diverged from the current state of this repo quite a bit. I would like to make a pull request for some small changes to incorporate maturin. Unfortunately I cannot just make a second clean fork... What do I do?

flaport commented 1 year ago

Yeah, that was my mistake. This is a public repo and I treated it as one of my private ones by force pushing something.

maybe just make the PR anyway and tell me which files you're interested in. I might try to solve it that way

flaport commented 1 year ago

Symmetry constraints: I still need to figure out how to do this best

one way to enforce symmetry is to add your transformed latent matrix with a symmetric version of itself.

For example using this

latent_t = latent_t + latent_t[::-1]

yields the following symmetric design:

image

jan-david-fischbach commented 1 year ago

one way to enforce symmetry is to add your transformed latent matrix with a symmetric version of itself.

I would assume that adding it with its transposed then leads to a symmetry along a diagonal?

lucasgrjn commented 1 year ago

Symmetry constraints: I still need to figure out how to do this best

one way to enforce symmetry is to add your transformed latent matrix with a symmetric version of itself.

I fully agree but in this case, we wont gain any time unfortunately

jan-david-fischbach commented 1 year ago

As we have a quite fast generator implementation by now I will close this Issue. Feel free to open other issues for the other suggestions made here.

flaport commented 1 year ago

I was able to improve the rust algorithm performance even more:

image

It's nice to see a linear behavior between # pixels and computation time :)

lucasgrjn commented 1 year ago

Very nice improvements !