Use degrees of freedom associated to each cell for load balancing

Instead of each processor owning the same number of cells, each processor owns the same number of dofs. There are two caveats:

The initial mesh is partitioned using only the number of cells because it is done before the creation of the DoFHandler. This means the load balancing is bad until the first time the mesh is adaptively refined/material is added.
I measured a significant slowdown when refining the mesh. However this was more than compensated by the speed up in evaluate_thermal_physics.

adamantine-sim / adamantine