LineaLabs / lineapy

Move fast from data science prototype to pipeline. Capture, analyze, and transform messy notebooks into data pipelines with just two lines of code.
https://lineapy.org
Apache License 2.0
661 stars 58 forks source link

Add opencv package annotation #838

Closed lazargugleta closed 1 year ago

lazargugleta commented 1 year ago

Description

Train method in the Machine Learning section of OpenCV library is now a part of the saved artifact of Linea. Class Algorithm has many different subclasses that use this method and this annotation file covers them all.

Type of change

How Has This Been Tested?

There are two separate tests in the 'tests/end_to_end/test_opencv.py' file. One tests the K-Nearest Neighbors model, and the other the Logistic Regression model. Results should always be a match since the seed is set, but the code for the linea artifact also reproduces the train method in it.

Before adding annotation:

import cv2 as cv
import numpy as np

newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
lr = cv.ml.LogisticRegression_create()
ret, results = lr.predict(newcomer)

After:

import cv2 as cv
import numpy as np

trainData = np.random.randint(0, 100, (25, 2)).astype(np.float32)
responses = np.random.randint(0, 2, (25, 1)).astype(np.float32)
newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
lr = cv.ml.LogisticRegression_create()
lr.train(trainData, cv.ml.ROW_SAMPLE, responses)
ret, results = lr.predict(newcomer)
andycui97 commented 1 year ago

Noticed the tests are failing with

    import lineapy
>   import cv2 as cv
E   ModuleNotFoundError: No module named 'cv2'

This is because we don't have opencv as a dependency. I would look into pytest.importorskip.

We use it for our mlflow test in this branch and PR. https://github.com/LineaLabs/lineapy/pull/829/files

andycui97 commented 1 year ago

For the other formatting "tests" like black, flake8, isort, and mypy its easiest to check these using pre-commit.

I like to run pre-commit run --all locally first to ensure the PR doesnt have formatting issues.

andycui97 commented 1 year ago

And for the PR itself, I noticed that the annotation has class instance Algorithm

whose API doesnt actually have train. https://docs.opencv.org/3.4/d3/d46/classcv_1_1Algorithm.html#details

It seems like the two models in your test case inherit from the subclass StatModel which does have train in its API

https://docs.opencv.org/3.4/db/d7d/classcv_1_1ml_1_1StatModel.html

May be wrong here, but did you actually intend to use class instance Algorithm?

lazargugleta commented 1 year ago

You are correct, but although StatModel is a parent class to all the listed and has the train method, the annotation does not work with it, but only the Algorithm class. I tested it again and here is the output when the class instance is StatModel in the annotation:

import cv2 as cv
import numpy as np

newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
knn = cv.ml.KNearest_create()
ret, results, neighbours, dist = knn.findNearest(newcomer, 3)

And this is with the Algorithm:

import cv2 as cv
import numpy as np

trainData = np.random.randint(0, 100, (25, 2)).astype(np.float32)
responses = np.random.randint(0, 2, (25, 1)).astype(np.float32)
newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
knn = cv.ml.KNearest_create()
knn.train(trainData, cv.ml.ROW_SAMPLE, responses)
ret, results, neighbours, dist = knn.findNearest(newcomer, 3)

As you said it makes sense that StatModel should be class instance. Do you have any ideas why that is not the case?

lazargugleta commented 1 year ago

Thanks for the hint! I updated the tests with importorskip and works flawlessly If the module is not present.

lazargugleta commented 1 year ago

Hey @andycui97 I just reran the pre-commit command locally and it gives errors still in the mypy section. It just seems to me that I did not touch any of those files mentioned. Do you have any ideas why is that happening?

andycui97 commented 1 year ago

You are correct, but although StatModel is a parent class to all the listed and has the train method, the annotation does not work with it, but only the Algorithm class. I tested it again and here is the output when the class instance is StatModel in the annotation:

import cv2 as cv
import numpy as np

newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
knn = cv.ml.KNearest_create()
ret, results, neighbours, dist = knn.findNearest(newcomer, 3)

And this is with the Algorithm:

import cv2 as cv
import numpy as np

trainData = np.random.randint(0, 100, (25, 2)).astype(np.float32)
responses = np.random.randint(0, 2, (25, 1)).astype(np.float32)
newcomer = np.random.randint(0, 100, (1, 2)).astype(np.float32)
knn = cv.ml.KNearest_create()
knn.train(trainData, cv.ml.ROW_SAMPLE, responses)
ret, results, neighbours, dist = knn.findNearest(newcomer, 3)

As you said it makes sense that StatModel should be class instance. Do you have any ideas why that is not the case?

image image

@lazargugleta I think the library needs to be cv2.ml and not cv2 for StatModel

andycui97 commented 1 year ago

Hey @andycui97 I just reran the pre-commit command locally and it gives errors still in the mypy section. It just seems to me that I did not touch any of those files mentioned. Do you have any ideas why is that happening?

Not sure, but we can discuss this in slack if you're still getting it since it probably isn't relevant to this PR.

lazargugleta commented 1 year ago

After discussing with @andycui97, we saw a difference between versions of opencv 4.6.0.66 and 4.5.5.64. The older version does not contain class StatModel in module cv2.ml. Although StatModel contains a train method, we will keep Algorithm as the class instance and module cv2 instead of cv2.ml because it accepts more versions of the library.