Fix performance bug in thread_local_var

thread_local_var has a poor performance if a thread_local_var object is created each test iteration. This poor performance is caused by some oversights:

thread_local_ctx::thread_localalloc always grows the entries vector, even if an entry is unallocated (i.e. thread_localfree was previously called). The entries vector is not cleared between iterations either. This causes O(iteration) memory usage (due to the entries_ vector), and adds O(iteration^2) time overhead (in iteration_begin).
thread_local_var::~thread_local_var never calls thread_local_ctx::thread_local_free. This means that, even if the above issue was fixed, the performance bugs might persist.

Squash both of these bugs:

Make thread_local_var::~thread_local_var call thread_local_ctx::thread_local_free if necessary.
Make thread_local_ctx::thread_local_alloc reuse freed entries.

Unfortunately, this commit has one negative consequence: Win32-style Tls use-after-free is undetected in some cases.

dvyukov / relacy

Fix performance bug in thread_local_var #15