Redefine F1 and RMSE scoring

This makes for a more robust measurement of F1, precision, recall and RMSE.

Now there should never be empty rows in the F1 DataFrame as it has been before. Instead these rows are simply omitted.

The largest difference is how scores for features of avalanche problems are calculated though. Earlier, the scores were computed on the whole column. That meant that the scores were greatly influenced by how accurate the avalanche problem columns "problem_1", "problem_2" and "problem_3" were. Now the scores for these attributes are only computed for rows where the given avalanche problem exists in both the label and in the prediction.

This has some implications. We will get a more representative score of how the model performs when an avalanche problem is correctly predicted. However, we will have no indication of how the internal attributes perform when the avalanche problem is not predicted. This shouldn't be a problem, since there's not an obvious way to reason about such data. The previous calculation didn't give any indication about this either, as it was mostly a function of the primary avalanche problem prediction.

When combining F1 DataFrames in a kfold loop, missing columns must be taken care of, therefore the demos now have a more complex handling of this.

NVE / avalanche_ml

Redefine F1 and RMSE scoring #23