Prediction Difference Between Sensenet and bigml.com

Sensenet produces predictions for image models that can be different from what you'd get remotely from bigml.com. This is due to a variety of differences between the production and local settings, enumerate in rough order of importance below.

1.) The BigML prod environment resizes the image to a maximum of 512 x 512 before storing it, using bicubic interpolation. If clients to not do this resizing, or do it using something other than bicubic interpolation, the image will be different.

2.) JPEG compression is applied (quality 90%) to the source when it is stored. When used to make a prediction, the source is decompressed. Because JPEG compression is lossy, the values are bound to be different.

3.) The JPEG standard is underspecified, so the same image decompressed by two different software packages, or even two different versions of libjpeg might have small differences (https://photo.stackexchange.com/questions/83745/is-a-jpg-guaranteed-to-produce-the-same-pixels#:~:text=https%3A//photo.stackexchange.com/a/83892). The version used by tensorflow, for example, does not by default match the output of the version used by Java, and requires special options to be set (https://stackoverflow.com/questions/44514897/difference-between-the-return-values-of-pil-image-open-and-tf-image-decode-jpeg/60079341#60079341). Pillow's output is also different. So even aside from the rescaling/recompression issues, the input images are unlikely to be exactly the same. I've done tests and the difference is enough to shift the results in a classification model by 1%. Because of this, even apart from the rescaling/recompression issue, the input image will still be different in the case of JPEGs because of the initial decompression.

4.) Tensorflow running on different hardware can give different results (https://github.com/tensorflow/tensorflow/issues/19200#issuecomment-388972596). This is not just a CPU vs. GPU problem, but can also occur with different builds of Tensorflow. The central problem is that there are so many operations in a deep neural network that even errors in the least significant bit accrue over time to something significant, especially because we're only using 32-bits for the math. Our test suites have examples where the same test does not give the same output on the same TF version on mac and linux.

Whether or not these things merit "fixing" is beyond the scope of the issue, but clearly at least some of them could be mitigated with additional compute time if desired.

bigmlcom / sensenet

Prediction Difference Between Sensenet and bigml.com #38