Replace temporary workaround in MFU computation

In #215, the function get_theoretical_flops_per_token was created using a temporary workaround to ensure that the computation is conducted only in the presence of GPUs. This is because the underlying function get_total_number_of_trainable_parameters requires GPUs. However, in principle, get_theoretical_flops_per_token depends only on the model architecture.

The if statements could be removed if mocking of the get_total_number_of_trainable_parameters function was used for CPU tests.

Modalities / modalities

Replace temporary workaround in MFU computation #238