Performance: Incremental build and directory structure

I'm looking into the problem of scaling SourceBrowser right now. Currently, running our entire codebase through the indexer takes ~28 hours on a pretty decent machine. I see two options:

1) Federation 2) Incremental build

The former has the problem that federated text search is messy. It also requires some reasonable decisions about how to break up a codebase, and maintenance on all individual instances. Given this, I'm focused on the second option.

My analysis is that there are two halves to this issue:

1) Which assemblies should be re-indexed? I see this as a separate question. Some other tool should either feed this information to SourceBrowser, or signal it by deleting indices that should be rebuilt. 2) How to rebuild an assembly's index? This is actually quite complex, because indices of assemblies modify each other's folders in ways that are not always easy to trace.

An intermediate goal I am considering, then, is to deepen the division between SourceBrowser's indexers two phases (Pass1-Generation and Pass2-Finalization) by introducing two invariants:

1) The ProjectGenerator at the heart of Pass1, as well as anything it creates (DocumentGenerators, etc.), shall write ONLY to the relevant assembly directory.

2) The ProjectFinalizer at the heart of Pass2, as well as anything it creates, shall read from assembly directories, and write only to a new set of "output" directories, from which the website is actually run.

In my mind, this has a few benefits:

1) Deleting an assembly directory signals to Pass1 that regeneration is needed. Pass2 can be re-run each time, deleting and rebuilding its output directories as it does so.

2) Intermediary files (i.e. the txt files) can be left in place without intermingling with the HTML that comprises the website. This means cleaning these up is less pressing, and can make debugging easier.

3) Because of the "index" and "website" structure separation, each can be specialized to be easier to work with for its purpose.

4) Parallelism becomes simpler, because during Pass1 no two ProjectGenerators need ever interact.

Given that this is a fairly significant re-factor, however, I wanted to gather feedback from others who have worked on the project. Am I missing some danger of this approach? Is there a better approach to gaining incremental builds? Is there a better approach to improving performance? Alternative ideas that might save me a lot of refactoring would be greatly appreciated.

Some quick points:

Whatever you'll end up with, I won't take it back as a PR. Nothing personal, it's just the project in this form is kind of done, and to take it back means I'm responsible for maintaining it and fixing problems in it. There are multiple approaches to refactoring the backend. But of course I'd welcome such a development in a fork and maybe can help with some general guidance. Of course if the fork is successful I'd advertise it on the front page to let people know about it.
I suspect that the cheapest (hackiest?) way to speed things up might be indexing multiple projects in parallel. Right now we parallelize documents in a project, but each project is indexed after another serially. I'm curious if there are quick wins there if you just do Parallel.ForEach and tweak the degree of parallelism. Of course the biggest challenge there is you need to throw a whole bunch of locks to protect data structures that can be accessed concurrently. For instance, two projects, A and B, both use StringBuilder from mscorlib. When adding a reference from A to StringBuilder and from B to StringBuilder, need to surround it with a lock so that they don't overwrite each other.
I'd be also curious to profile to see where are we spending most of the time: MSBuild? Roslyn? Writing HTML to disk? Writing references? Maybe buying fast SSDs or hybrid drives that use RAM as storage can help with indexing speed. Maybe sacrifice some features, give up on MSBuild or TypeScript support?
Another tip that I can give is storing symbol reference information at source, not at destination. In other words, store "outgoing" references at usage place, not "incoming" references at definition. So instead of each assembly having a huge R folder where all incoming references are piling up, each assembly should have a subfolder for each referenced assembly, and pile the references (.txt files) per assembly. For example, your assembly A would have R\mscorlib... Basically each assembly would store all outgoing references (all API used from all reference assemblies) and carry it with them. This way you can delete and readd the assembly without disturbing the reference graph. Of course then the biggest problem is finding references on StringBuilder - you no longer have a precompiled list of all incoming references, you need to recompute it - basically visit all assemblies that reference mscorlib, and ask each of them "do you have a reference to StringBuilder?", then concatenate all the results. So for incrementality this may be really useful - only reindex assemblies that changed, then do a pass that reassemblies all references (much faster now, just a bunch of concats, no need to call Roslyn again).
Are you sure that 28 hours delay is completely unacceptable? Maybe if people expect that an index may be a day or two old they can just deal with it? And say even if you bring down the indexing speed from 28 hours to 5 hours, is it worth it? Is it that much of a difference? Maybe if you setup dual deployment right (staging/production/swap) you can assure a relatively constant stream of index updates? Just have two sites flip-flopping, show one while the other is being reindexed.

KirillOsenkov / SourceBrowser

Performance: Incremental build and directory structure #51