Better space performance for CC work list

This implements both suggestions in #148:

Each work list entry now contains an offset, indicating how much more of the given object needs to be traced. This allows for pushing large objects onto the work-stack in constant time and space.
CC now performs a full DFS mark/unmark loop for each individual item pushed onto the stack. This means that the order in which objects are traced, as well as the amount of space required for CC, should be identical to 6c1af00c2f3bc970ba46b0a4f4b90e840308b550 and before.

In terms of benchmarks, triangle-count was previously using 30% more space due to the CC work list (as measured in discussion here), but now appears to be using the same amount of space as it used to. I am still measuring that dedup-strings consumes more space than it did under 6c1af00c2f3bc970ba46b0a4f4b90e840308b550, but only on large repeat counts. This suggests the difference is due to CC timing issues, rather than the space consumption of CC itself. The space consumption of all other benchmarks I've tried doesn't seem to be affected much.

MPLLang / mpl

Better space performance for CC work list #150