arduino / arduino-builder

A command line tool for compiling Arduino sketches
GNU General Public License v2.0
458 stars 114 forks source link

Don't scan hidden directories (.git, etc) #328

Closed earlephilhower closed 5 years ago

earlephilhower commented 5 years ago

When checking for updated core files, don't scan inside UNIX hidden directories like .git. Scanning inside a GIT tree for a large project with many branches could take many minutes of runtime and multiple GB of RAM. This was seen in the Arduino core for the ESP8266, but is probably also an issue for other core development teams.

Fixes #327

Signed-off-by: Earle F. Philhower, III earlephilhower@yahoo.com

earlephilhower commented 5 years ago

Just did some proper, multi-run testing, and the PR is now properly filtering out .git from its scanning. Runtimes of 1:45 (on a 16-core Xeon server!) and memory usage of 4GB+ are now almost instantaneous and much, much lower RAM usage.

d-a-v commented 5 years ago

Thanks alot @earlephilhower

matthijskooijman commented 5 years ago

Sounds like a good change, but I have a few questions and remarks:

earlephilhower commented 5 years ago

Good notes. I think I can clear up most of them in reverse order of difficulty. :)

Where is this code actually used? I would expect in recursive scanning of scketches and libraries, but doesn't that only happen inside the src subdirectory? Do you have .git directories in there? I'm just trying to understand the problem here.

@d-a-v and I are maintainers of https://github.com/esp8266/Arduino/ , so we use ~/Arduino/hardware/esp8266com/esp8266 as the Arduino code and git repository we work from (branches, checking PRs, etc.) So there is a .git dir in the local hardware tree in our cases. It has something like 3000 commits and dozens of branches, so the .git dir has many, many files.

For normal users, though, your point is valid and they shouldn't have a .git in boards-manager installed tree. This change, for them, should be a no-op.

The actual code change itself is called from the wipeout_build_path... stage where arduino-builder is only trying to see if platform.txt has been changed (or other files with .txt extensions). Without this change, the findAllFilesInFolder does a massive amount of recursion and basically stats the entire GIT tree of 100K files or more, and stores that in memory only to be thrown away on return by the simple file.Ext()==".txt" iteration.

You're now ignoring all files starting with ., but IIRC there's already a list of files to ignore somewhere in the builder code (IIRC it's used to not warn about these files when they show up in a library or something). Perhaps that list could be reused instead?

The problem is any ignore list is only processed after the file list is in memory. It's not that there are some txt files in the git tree which are causing false rebuilds, it's that the builder is taking multiple minutes and GB scanning the GIT tree to build a really massive list of objects only to throw them out. We're trying to short-circuit that.

Making a low-level function like FindAllFilesInFolder skip hidden files might be a bit surprising for users of the function (it's not really returning all files anymore). Perhaps passing a list of files or patterns to ignore explicitly might be better?

You have a point, but I'm not sure it's a valid use case. FindAllFilesInFolder in the arduino-builder context is really looking for inputs to the build process (at multiple stages). I think it could be argued that if your build source files are contained in a hidden directory that you've got some issues, or at least the next guy to manage your code will have them! :)

I think it might be better to have a Bool flag for ignoring hidden directories (as opposed to an ignore list since there are also code coverage tools and others which make hidden dirs to store their artifacts), but the impact of such a change would be significantly larger in terms of code changes (but 0 logic changes, of course) since wipe... actually has 2 levels of indirection before this is called.

Alternatively, wipe... could be rewritten to be more intelligent about the parsing. Now it's doing the equivalent of select * from core-tree; and then dropping everything but .txt files. There should be a way to express select * from core-tree where extension='.txt';. I'm not sure my Go skills are up to that, but if that's what you'd like I could give it a shot.

earlephilhower commented 5 years ago

Looks like this patch was implemented in the replacement arduino-cli:

https://github.com/arduino/arduino-cli/blob/master/arduino/builder/sketch.go#L91

Closing since it's no longer relevant.