Update compile test case to use larger test system

This PR follows up the earlier torch.compile support #300 and aims to make the input test a bit more realistic with 64 carbon atoms. Added additional test cases that use the pytest-benchmark plugin to collect timings for different options.

One subtlety/controversial change is that the correctness test (test_mace) now uses torch.testing.assert_allclose as this uses more permissive comparison tolerances than using assert torch.allclose directly.

Measuring the inference time on an A10G:

	Time (ms)	Speedup vs eager fp64
Eager fp64	65.1	1
Eager fp32	23.4	2.8
compile default fp32	11.17	5.8
reduce-overhead fp32	9.75	6.7
compile default mixed precision	8.84	7.4
max-autotune fp32	6.75	9.6
reduce-overhead mixed precision	4.81	13.5
max-autotune mixed precision	4.25	15.3

ACEsuit / mace

Update compile test case to use larger test system #310