Deeplite / neutrino

Public documentation of Deeplite Neutrino™ engine.
Creative Commons Zero v1.0 Universal
64 stars 5 forks source link

Optimisation throws and empty error message #16

Closed BhandarkarPawan closed 2 years ago

BhandarkarPawan commented 2 years ago

I am trying to optimise my custom Pytorch Unet Segmentation model, trained on the Serverstal Steel Dataset . The model has been trained and the model state dict is stored.

I tried to optimise it using the name method as the Unet Example. I am trying to run it using the --dry flag but I get the following error:

2022-02-21 07:00:57 - INFO: Starting job with ID: 839134C5
2022-02-21 07:00:57 - INFO: Args: -w, ../working_folder, -m, model.bin, -d, data.pkl, --dry
2022-02-21 07:00:57 - ERROR: Ooops something went wrong, job id 839134C5
2022-02-21 07:00:57 - ERROR:
2022-02-21 07:00:57 - INFO: Job with ID 839134C5 finished
2022-02-21 07:00:57 - INFO: Total execution time: 0:00:00 (d, hh:mm:ss)
2022-02-21 07:00:57 - INFO: Log has been exported to: /root/.neutrino/logs/839134C5-2022-02-21.elog

The Error message seems to be empty. When I try to view the log using cat, it just displays some gibberish:

Z0FBQUFBQmlFemlscGMyZ0E3NGRDUWZwR3Nvbm5DRWs5SGpQRDZoNWhJYXhkaG1VV3Nhall4RzM4azh4dDVKc0hkbEI4VWNSTVY1WFczbko3S3p2NVpyVFNLV1lCZlVsR0NBMGtlU2t0QnJ6bGNCM04yLWtzdmZaNDlYV051Rm9ZelExRGRWN1VTQ0JDX3QtVjNoTjFnTjV4Z1U1ZUxSU0I1YnQxRS1mVG1TaFJ6SmlmNVVMSF8yd1h4cFNiVDBvNnIteGhxSXMxUzZkN3NabTh0M1l4OWU2UFA4cFRGUm8xYVU0SFN3aG9IdmFnMU5rRXdHVTJUWT0=
Z0FBQUFBQmlFemlsVl9SUmRjcVVyYXBlTzhrZWhBZkpyZmxKWDBfdW5FVGJXUHJXYmlQUktHZnFTaUFDaVJSam4wN3BQekR2VEI1TTUwWnZQX19jdTBlQ3RPYVpaY3ctTC02ZDB2REw2QVhMQ0tzdmZXbUFRZWN2UFoyYmxIcnB0enRjV0xKYnFoMmdtanJPLVhQTFVNVXN5YmRoTElGcHZLam0tNWlJYmtkMnA0TUIxUzh0d01jN0FmdjRRbmtCRllwakMybmRoNGJG
Z0FBQUFBQmlFemlsc1ZIUGlNNHg2S2c1MlhPVHdKd1J0Z0plUG9CazZhUWUtYjJZYVVGWC1TeE1uYW51UWNodTJzY3Y0bzE5NlZ2NjhZNXN4d3JkUUVwMTlaLUk3c05VS3NJeXpMQW1GdGpxVUNuNTkybi1sUlM3OEtXOHFFamZXd3B1ZFAwOER6aHpWX19YdTdvUVV5S2kwNWktZzZ3Y2pvM3NTLWdjSk1fOV9WSC1tUkpteEprV2x2UktCeEt6TU5wdlZUSlB6bVRj
Z0FBQUFBQmlFemlwMklLUWJjTTYxS1VqWkM4YUlhMl94aks1YXpveEctUkdCWnJiR2NKSklUajNDR0JVeTFVNm9ZZFpVQzlQSDhPWjZiaGFobjNKWUxnVXdpQy1Bdm1mX2Zqb1dEVW5QTFZjRlEyLXc1aE1ZNVBaYVBJakZ2WkVWMVZfM2FwUzkwX0lUNzF5cm55S3BwcUxTdjBoak1TVGlybUpVbWgwX0lkRjVyNEpqcG5ELVE4PQ==
Z0FBQUFBQmlFemlwempFUTRJN0t6WVhrSGJ4Y2Uwb2pJYnR2X2JoVnlTaDZ4LTB4d21RclNMSE1OVUFMbHFCLVFrY0d6Und1OTBnOG9ZSWdtdVJMSG1jWGRmMy1tUHNZd1JPd1FBa0oxWWdwU05oZU5Pc1h5ZUtrajY2M0E4MUpCUFFQdmQ3Z1hydXNVZUpteldKcE9sVGFtaTdTMWR6Y0NMNDF2V1BuWDB0WVc5eDQzc09Ob0laajM5NlVYTmxybXlqRnNBLWZxZVBLQVl2cHBjVm5aekFEeTZmMmhzTEkwdz09
Z0FBQUFBQmlFemlwUU5FWjJWUnZUWXBFdEV2c1N6M1R0ekxTT01lcldNRGtQdTR2N3VKQXFrbVpMYlp5c25iNzVuOElUbUUybUNrN0dKdzdiOGptSFFIcmExbVFPTzRvTF9lSjdPV2hucVBONTZ5LWlGYWZsSm1ZcmdIZlM4UFpCOUpwZnlSWEZMVF9YUGFleE53QVZBLUFIV1hmazBfbHRuWjdLdk9MN2RBNVNqaVIzTkItWFVnUUFnalNFQ0MxTERPMW9kX2FyNzlEX0oxRUs4d2FQRF94NjB1Z0FWN2VmTTNyblFKOXk2Zl9BMFppQ0dMZlIySGhuVm1oMkxQM05wWVhmMDFjUk1oS05aSTNxbWNQbEFJUExOM3BnOVlYNGpWdm5uY1ZGXzlJSkJYVy1tbVdDX2s9
Z0FBQUFBQmlFemlwZUVreDJoYUtmTUhicmNOZXM2cjRMWklGU1JnTUdIdzd0bUItSDdQSGdVNkJDVDJpSkp1QXEzWkY4b1dUNWFMRTgzMTk4RjlJQTVhd0dtYlRrRVU2OG5yX0J6ektoanhFSXEzNEcwU2M1X0pYUHdXZ3p1YjhuTUJidEJqZUczWldOV0lXUXNKOTVlTEFhd1dGcmdPam1nWXc2YmVEN1RVRTRyd0c5WEhtMlVwV1gzU3I0aGZfVlNuQXlERDlYMUNj
Z0FBQUFBQmlFemlwdURUanEzeE5zQ3d6dFpJa0Z1Wm1aX3ZVVkRiNzJPa0w2Ty02ekpuWDI5eHZXOVlVaDlSdU5LQ1ZXTkkxRjhzaW1KTWNvSG1QZlNWUjFYaWJody04MXhjN3lIMy1PTllSanBqUF9MSjQwck1tai1ka09WX2dwRUpEQTBjSUswWGM=
Z0FBQUFBQmlFemlwYllBb3I0N2NZazdqUFpmekx5Y0xRcnZhb29qX09jMEtEVWlZSVpnZ0ZSTmxFTjUxNVBmQ0pUZ3lwVEJiQjJSM0dKSjFCR2Y4eTJpWkxIYnBpSDg4ZUc5MWp3cmU2OUhhbjJOSi1RMzd3VHpZdmtRNWVpcGVBejc0UFB6MVAzZ3NXV2JCZlpnazZLQzBvcVhUSUlESlRmN1NKaVNLa3p3VVZEdkNtNmVyQlpGSzZxTGpfNjVYclhkODEzLUpocmVhbDFHOXlmWVVuS3dIVDZlTTctdDFvWTdVSkQ5R2hadUlOR2lZWjB5RUhJVnlsLVJFSTdiVFRsRVZZbUdoZHVQcDIwN2FjQmpWc0tmV2psejJlM2ZOTUZfSlcwR2NSN2FELTRRclZ0elctQ2syMHNMRzZRZzd6VGlfcVdCZjJXaHhlNTE2LTl3d0Vxb2ZuU1dxUDdqMEVmMHI0blZQZDN1dHNreXhzLVFKa1FXOFVwR2t0ZE9UT2M3aFd4cUl1dUFNYkg2ZmtoTXItNkJrbmw3VUhZTG9QRjd0QmpTZ1E0OXR1eXFTWFR3dV93VDhLcmNtclZjWlNVVlpsbG5vdkw1M3lidlctcDhkWjNNUjlXd0h2RHhCdkJ6U0ZMQ3ZtaDhBaFdPZmdXUFNmdzg5MFpEY01pWlNGZ2ZidmFQTVA4aThsRmdxZmhCelNxdi1HeG5WdFNmZ3lLVzV1WVBVaWNrU2tPeFpzTDJic09wNG5pT0NGVEwyNDBEZHlHS0tDdGEzTzJXbzhyWXBtdWp2UEZkNVlRQWxKUWF2U01DLVM4dVV1ZGI5N2l0OTh0OG0xYUc5S1huMEpsMEFwTDdEWHpmQno1QThmMHlsZXZvazlSMjZ0cl93NER3UkF4Yk44RUs5ODlnUkNJQ05ISGtqUXQ2NlpzaTNYaGd5N2sxa1RuR1RGN2twT1FzVGNERFJ4amxqbXBvcjMzWDM0NWQzQkxkbGZKZUJ2Wjd3d1lYa1JNNUFoSmhaNjFtZVhnNmxEc3FMZGVwb1ltZWd0ckRrdWVJc3Zhc3ViVkI4YVBsRTd5ZUpsV2RZQW4wUWZlSV9TeFQzXzVETHRkcldna3FLYmVCUW40R3VlREV2Wl96aGhPSWJsVkVVMDU1eFY3dElzRTVOYVVjUXhaTFFFUmVtY09VRm1lcE9hTjBQOTdKbU91RnFfS2tHOHNFNXBJUl9IekMtSFFNSGwwa0JJVHJuVmVfYzdsMzBjaF80aGZXUk4tLUhiYktrajI3RDJ3aG15LVpZazBnUXdqNVBKR0hzeHJsSGtkWGhNMWdPRzJXSUFWaHBibVhiME5scFlvYnRkSlJIb2RSeGNDdEEtZGU5Qm9hb1dkcThJMV9veU9sVWlEdmw0blVUbDduMTJ3bEZNMHFKcG4yQ0U3bXZrbmlwY2Y1SUtEWUZFTnlsOGhGUjVSRklCUEIyVWZXMzRMVV8weVZKSGhoVlBWQlBJYUxkMjhYVnBqVFhvWFNHZ2RKYWswOVZoUThRWXJKc0Q3WUl0c1VMbEEyQmNaQkdJYWhUUUFseXRTSERvSkFhLWxWM2NScXZjdFBia05aUFh3U2Eza2V4eWFhQlJ2RTdTTkpfVy1ET0FucWh3M0Y1RHNGVnd0TmRKbEVoZFE9PQ==
Z0FBQUFBQmlFemlwdTJlOUFJZ0tmQzZ4ZmF1ZTdIMnp5Z0M2U1YwLXU4Q3hRY1U5dFFMcXBMMXkwMUhSaU9pLTFfRDQzZFIzWllPTDdWZElLZ052b0J0Mm02bDBmLUlDbm1vdDBVeW15UlZTd3plSllxb09YdFh4WDJlYnE1UENPbjkyYmxldUxrVHZ0eFBPLWZBQ3ZZUTBWSWVvdE9uYmRfVDdLOVJaSEtBN3dyN0JNazdfejl3PQ==
Z0FBQUFBQmlFemlwT01VWEd3UU9kZnNnRmJxdGdoN1Y2U1pYWDNqa0lOU0NvSUZNQU9fbUdGNDlxX1NNSkxJQTNoWm4zZ2x6cnN5N1A2YmZJWkpHRzNKZXRKckI3c3FuLTRiRXQwT3VEdzd5QktVd3JfT0dvUTlNX01IdzNIeGQ1c0xwOEg2TExMRERyeDRKLW9xeU9DeWtNeEhmRUsxZWNIOU82UFZ6N1VXajdrZndOVkNVMFhCMGNrVGh4SXlCdkVtSURVUUNxQ0FX
Z0FBQUFBQmlFemlwR3NyZmRiN3liV21fMUxiT0xvMFVMV20zeFkwNkEwOWI3bk94U0J0MEI0TWhZTERpU0FYZXBvSDlEdEhnVGZxZXlIVms0SFdtUDRMdzVsczVreTg0eVZ5bG90blJxSk5NWkVPLVJGaGR1YjQ4WVozYzN4RVRWc2dWcVpjRmFqaW8wNHJfZ2EzRTlxUFZTNjhKT2dxMTduMWljSHJrQ0VUQUhlQWdWdjVuOWw1UDgwUXptU29tVGlhXzRtZ09uVWVOMUNod3NTNGdqb1pNdlQ5NFQ1Njc1ejRnNWdOM2Y3LUlZUWJRMnRDR205cz0=

My dataset class:

from torch.utils.data import Dataset

class SteelDataset(Dataset):
    def __init__(self, df, data_folder, transforms):
        self.df = df
        self.root = data_folder
        self.transforms = transforms
        self.fnames = self.df.index.tolist()

    def __getitem__(self, idx):
        image_id, mask = make_mask(idx, self.df)
        image_path = os.path.join(self.root, image_id)

        img = Image.open(image_path)
        img = np.array(img)
        img = img.astype(np.float32)

        augmented = self.transforms(image=img, mask=mask)
        img = augmented["image"]
        mask = augmented["mask"]  # 256x1600x4
        mask = mask.permute(2, 0, 1)  # 4x256x1600
        return img, mask

    def __len__(self):
        return len(self.fnames)

Model Class:

import segmentation_models_pytorch as smp

class UnetSegmenter(nn.Module):
    def __init__(self, num_classes: int):
        super(UnetSegmenter, self).__init__()
        self.unet = smp.Unet(
            encoder_name="resnet34",
            encoder_weights="imagenet",
            classes=num_classes,
        )

    def forward(self, inputs):
        outputs = self.unet(inputs)
        return outputs

Optimise function:

def optimize_model(config, model_path: str, dataloaders, num_classes, dry=False):
    state_dict = torch.load(model_path)
    model = UnetSegmenter(num_classes=num_classes)
    model.load_state_dict(state_dict)

    print(f"Model loaded successfully with dry = {dry}")
    print("Data Loaders: ", dataloaders)
    print("Config: ", config)

    model_name = "unet"
    device_map = {"CPU": "cpu", "GPU": "cuda"}
    fp = TorchForwardPass(model_input_pattern=(0, "_", "_"))
    eval_fn = UNetEval(model_name)
    loss_cls = UNetLoss
    loss_kwargs = {"net": model_name, "device": device_map["CPU"]}

    opt_model = Neutrino(
        framework=TorchFramework(),
        data=dataloaders,
        model=model,
        config=config,
        eval_func=eval_fn,
        forward_pass=fp,
        loss_function_cls=loss_cls,
        loss_function_kwargs=loss_kwargs,
    ).run(dryrun=dry)

    return opt_model

In the above, the UNetLoss and UNetEval are directly copied from the neutrino unet example linked above. The optimisation process does not run and I am not sure how to debug this error, since no information is provided and the logs are unusable. Please help me debug this.

yasseridris commented 2 years ago

File "/usr/local/lib/python3.8/site-packages/pkg_resources/__init__.py", line 777, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.ContextualVersionConflict: (numpy 1.21.4 (/usr/local/lib/python3.8/site-packages), Requirement.parse('numpy==1.18.5'), {'neutrino-engine'}) " python_logger: "neutrino.job" }

Looks like you installed neutrino directly on system and there's a conflict with numpy. Please reinstall neutrino in a separate virtualenv. You can only see user errors related to bad-implementation of the api's, torch related errors, etc.. Otherwise, all logs are encrypted.