Implement multiprocessor TLB shootdown on i686

On x86 processors, flushing the TLB on one CPU won't have an effect on other CPUs. This can be a problem when there are multiple threads in one process, each running on different CPUs. More significantly, flushing a TLB entry in the system address space requires flushing the TLB on every running CPU. In order to get around this, one CPU has to send inter-processor interrupts (IPIs) to other CPUs to perform a "TLB shootdown". The CPU will receive the IPI, flush the TLB entry, and then continue executing. Doing this TLB shootdown IPI is necessary for stability on SMP and NUMA systems. Since the i686 microkernel can now start multiple CPUs and get them running in the kernel, the next logical step is providing a stable multiprocessor environment.

TLB shootdown using only IPIs is relatively simple to implement. There will be several fields in each process structure related to the TLB shootdown, as well as global copies of these for global TLB shootdowns (in the system address space). Upon flushing the TLB, the CPU will try to acquire a spinlock for flushing the TLB. If it fails to get the lock, it will wait for the TLB shootdown to begin and then check if another thread tried to shootdown the same region. If this is true, the thread will just do nothing and allow the shootdown IPI to come to it. Otherwise, it will again acquire the lock, this time spinning on it with no timeout. If the thread did get the lock, it will broadcast a shootdown IPI to all other CPUs. The other CPUs will receive the interrupt, check the range to invalidate, and do so.

However, what's described above is only the non-optimized scheme. There are a few optimizations that will increase scalability, preventing every CPU from being interrupted by IPIs all the time. The most important optimization is called lazy TLB invalidation. If stale TLB entries will cause a page fault, then no IPI is sent and the page fault handler will check for stale TLB entries, invalidating the TLB itself if there are. This optimization will prevent the need for sending an IPI if a page changed from not present->present, read-only->write, or non-executable->executable, eliminating many of the IPIs that would otherwise be sent to every processor.

The other optimization relies on the Local APIC's logical destination register, which allows IPIs and other APIC-based interrupts to only be sent to a subset of CPUs. Interrupts sent to a logical destination will only be sent to CPUs that have at least one matching bit with the destination in their logical destination register. The logical destination register is 8-bits wide. Each process will have a field that represents a bitwise-OR of all logical destinations that are running one of its threads. How this field works depends on the number of processors. On systems with 8 processors or less, each bit can represent a processor and this field will reflect which processors are running a thread in the process. On systems with more processors, the logical destination register in a Local APIC will be programmed with its currently running process ID mod 8, each bit representing one of those results. Though not as ideal as the optimization for 8 processors or less, it will still eliminate many unneeded IPIs.

darksideos / darkside-kernel

Implement multiprocessor TLB shootdown on i686 #1392