alandefreitas / clang-unformat

A simple tool to infer a .clang-format file from existing code
Boost Software License 1.0
58 stars 13 forks source link

Use Edlib to compute the edit distance #7

Closed Marzelpan closed 1 year ago

Marzelpan commented 1 year ago

When I applied clang-unformat to a code base consisting of 158 C++ files with 51k LOC, it took more than 10 hours on a Core i7 laptop to generate the configuration file, and on an Apple M1 Max it still took 3 h 41 min to complete (both with 8 threads in parallel). Profiling revealed that most of the time is spent calculating the Levenshtein distances. This patch replaces the original algorithm by a call to the corresponding function in Edlib (https://github.com/Martinsos/edlib).

On the Core i7 the patched version of clang-unformat with --parallel 8 takes 21m:41s to complete, and on the M1 the runtime is reduced to 4m:26s.

alandefreitas commented 1 year ago

When I applied clang-unformat to a code base consisting of 158 C++ files with 51k LOC, it took more than 10 hours on a Core i7 laptop to generate the configuration file, and on an Apple M1 Max it still took 3 h 41 min to complete (both with 8 threads in parallel).

Wow :open_mouth:

One of the reasons I wrote this small application is because the alternative I could find was an equivalent python script that was taking me many days to generate the configuration file.

The same project took me 30 minutes with clang-unformat and then 8 minutes with 4 threads.

I've never seen anything like 3h 41min but it's not hard to imagine a situation where this would happen.

Profiling revealed that most of the time is spent calculating the Levenshtein distances.

Yes. That's supposed to happen.

That's actually a good thing because it's a bottleneck that's easy to optimize.

If most of the time were spent on I/O then we would have a more serious problem.

This patch replaces the original algorithm by a call to the corresponding function in Edlib (https://github.com/Martinsos/edlib).

That's nice. How long does it take you now? Is it the same algorithm? Any conflicts with the license?

I guess other heuristics are also possible in large projects. In general, a reasonably large subset of the project should be enough to infer a valid configuration file.

Marzelpan commented 1 year ago

This patch replaces the original algorithm by a call to the corresponding function in Edlib (https://github.com/Martinsos/edlib).

That's nice. How long does it take you now? Is it the same algorithm? Any conflicts with the license?

With this patch it only takes 4.5 minutes on the M1. The algorithm (an extended version of Myers' algorithm, according to the paper linked on the Edlib page) calculates the Levenshtein distance, so that the resulting clang-format configuration is identical to the configuration generated by your algorithm, including the calculated distances. This is really a drop-in replacement. Edlib's license is the MIT license, so I do not see a problem in using it.

alandefreitas commented 1 year ago