Schindler-EPFL-Lab / thermo-nerf

ThermoNeRF (Thermographic NeRF)
https://malcolmmielle.github.io/publication/thermonerf
Other
30 stars 2 forks source link

mlflow.exceptions.MlflowException: Got invalid value 21.12470817565918 for metric 'psnr' (timestamp=1722997810318). Please specify value as a valid double (64-bit floating point) #7

Open Neal2020GitHub opened 2 months ago

Neal2020GitHub commented 2 months ago

Hi authors!

I encountered this error when runnning python scripts/train_eval_script.py. Any idea how to solve it? Thank you.

mlflow.exceptions.MlflowException: Got invalid value 21.12470817565918 for metric 'psnr' (timestamp=1722997810318). Please specify value as a valid double (64-bit floating point)
MalcolmMielle commented 2 months ago

Thanks for letting us know! We will have a look in the upcoming weeks

MalcolmMielle commented 2 months ago

Are you training concat-Nerf?

If yes, could you try changing l225 in concat_nerfacto_model.py to metrics_dict["psnr"] = float(self.psnr(predicted_rgb, gt_rgb).item()) instead of metrics_dict["psnr"] = self.psnr(predicted_rgb, gt_rgb)?

Just a suspicion, might not be the problem

YunSeok-Kang commented 2 months ago

Hi @Neal2020GitHub,

I also encountered the same issue and was able to resolve it using the method I'll describe below. I'm not entirely sure if it's the best solution, but it worked for me, so I thought I'd share it with you.

The error you're seeing seems to be related to how mlflow handles metric values, particularly when dealing with tensors in PyTorch. I made a modification to the _validate_metric function in the mlflow/utils/validation.py file. The updated code includes a check to see if the value is a PyTorch tensor, and if so, it converts the tensor to a float using the .item() method. Here's the code I used:

def _validate_metric(key, value, timestamp, step):
    """
    Check that a metric with the specified key, value, timestamp, and step is valid and raise an
    exception if it isn't.
    """
    _validate_metric_name(key)
    # value must be a Number
    # since bool is an instance of Number check for bool additionally

    if type(value).__module__ == 'torch' and type(value).__name__ == 'Tensor':
        value = value.item()  # Convert to float if it's a tensor

    if not _is_numeric(value):
        raise MlflowException(
            f"Got invalid value {value} for metric '{key}' (timestamp={timestamp}). "
            "Please specify value as a valid double (64-bit floating point)",
            INVALID_PARAMETER_VALUE,
        )

    if not isinstance(timestamp, numbers.Number) or timestamp < 0:
        raise MlflowException(
            f"Got invalid timestamp {timestamp} for metric '{key}' (value={value}). "
            "Timestamp must be a nonnegative long (64-bit integer) ",
            INVALID_PARAMETER_VALUE,
        )

    if not isinstance(step, numbers.Number):
        raise MlflowException(
            f"Got invalid step {step} for metric '{key}' (value={value}). "
            "Step must be a valid long (64-bit integer).",
            INVALID_PARAMETER_VALUE,
        )

This change helped me resolve the issue by ensuring that the value is correctly processed as a float, which prevented the MlflowException you mentioned. I hope this solution works for you as well, but please let me know if you find a better approach or if you encounter any further issues!

MalcolmMielle commented 2 months ago

@YunSeok-Kang nice hack! I think though that it's probably due to a problem in ThermoNerf so maybe we can find it all together. I'm under the impression that somewhere, a tensor is passed to mlflow instead of a float.

Does that problem happen when running ThermoNerf or ConcatNerf? I'm suspicious on this with ConcatNerf

MalcolmMielle commented 1 month ago

We didn't manage the reproduce this issue. Are you using a newer version of mlflow than the one on the dockerfile?