deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.05k stars 648 forks source link

Is it possible to train a PyTorch SSD model on an M1 Mac - or is this not yet implemented? PtNDArrayEx.multiBoxPrior(PtNDArrayEx.java:697) UnsupportedOperationException: Not implemented #2693

Open juliangamble opened 1 year ago

juliangamble commented 1 year ago

Description

When running TrainPikachuTest on an M1 Mac I get the error UnsupportedOperationException: Not implemented

Expected Behavior

The TrainPikachuTest runs as expected and a model is produced.

Error Message

Exception in thread "main" java.lang.UnsupportedOperationException: Not implemented
    at ai.djl.pytorch.engine.PtNDArrayEx.multiBoxPrior(PtNDArrayEx.java:697)
    at ai.djl.modality.cv.MultiBoxPrior.generateAnchorBoxes(MultiBoxPrior.java:68)
    at ai.djl.basicmodelzoo.cv.object_detection.ssd.SingleShotDetection.forwardInternal(SingleShotDetection.java:84)
    at ai.djl.nn.AbstractBaseBlock.forwardInternal(AbstractBaseBlock.java:128)
    at ai.djl.nn.AbstractBaseBlock.forward(AbstractBaseBlock.java:93)
    at ai.djl.training.Trainer.forward(Trainer.java:189)
    at ai.djl.training.EasyTrain.trainSplit(EasyTrain.java:122)
    at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:110)
    at ai.djl.training.EasyTrain.fit(EasyTrain.java:58)
    at ai.djl.examples.training.TrainPikachu.runExample(TrainPikachu.java:93)
    at ai.djl.examples.training.TrainPikachuTest.testDetection(TrainPikachuTest.java:52)
    at ai.djl.examples.training.TrainPikachuTest.main(TrainPikachuTest.java:30)

How to Reproduce?

Run the class TrainPikachuTest on an M1 Mac

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Run the TrainPikachuTest class with DJL_DEFAULT_ENGINE=PyTorch

What have you tried to solve it?

  1. Debugging through the code - and looking at the implementation of the class.
  2. Looking for other examples of training doing SingleShotDetection. (Didn't find any).

Environment Info

DJL_DEFAULT_ENGINE=PyTorch
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-11.jdk/Contents/Home
zachgk commented 1 year ago

MXNet has several helper operators specific to SSD and they were used as part of the DJL SSD model you are using. Unfortunately, MXNet doesn't support M1 and the model doesn't run on PyTorch.

If you are interested in contributing here, you could build an implementation of SSD that does not rely on those operators or you could add the missing implementations as part of PtNDArrayEx.

juliangamble commented 1 year ago

@zachgk thanks for getting back to me. Thanks for creating an opportunity to contribute.

I'm sizing it up - and working out a specification and way to measure if it is working. In terms of a specification - it seems to be this class here: https://github.com/apache/mxnet/blob/master/src/operator/contrib/multibox_prior.cc Please help me out if you know a better one.

In terms of measuring if it is working - I'm looking in here - and not finding anything that corresponds: https://github.com/apache/mxnet/tree/master/tests/cpp/operator

Can you help me out with how you would measure a working implementation?

zachgk commented 1 year ago

Probably the easiest way to test whether it is working is to use a hard-coded value for inputs and outputs. We have some examples in OptimizerTest.

So, find a known sample data and then you can put it into the integration suite so it is run in all engines. This way, it ensures that all engines have matching behavior (including between the MXNet version and your new implementation). It also ensures that the behavior won't change because it would require also changing the values in the test

juliangamble commented 1 year ago

I'll get back to you - I'm writing a test.

juliangamble commented 1 year ago

I've done a pull request on this. https://github.com/deepjavalibrary/djl/pull/2715 The two different unit tests nearly match up, but not quite - so I'm asking for some help on this.