aimacode / aima-python

Python implementation of algorithms from Russell And Norvig's "Artificial Intelligence - A Modern Approach"
MIT License
7.97k stars 3.77k forks source link

Add examples and tests for Cross-Validation algorithm #735

Open adityaXXX opened 6 years ago

adityaXXX commented 6 years ago

Hey @norvig, @MrDupin: currently, the learning.ipynb notebook does not have examples of cross-validation. We can include cross-validation (for example) to perceptron classifier to show how its accuracy can be further increased and similarly for some of the other machine learning classifiers. We can also include code for GridSearchCV and then also for RandomSearchCV to show how it is computationally better than GridSearchCV.

Should I go for it?

antmarakis commented 6 years ago

About cross-validation, I am not sure at this stage we can do much with the code, unless we rewrite the function. See this.

I have not heard of GridSearchCV before, it's a learning algorithm, right? If it is not in the book, I am not sure whether it should to be added the repository or not. I would only add it if it doesn't need much code (that more-or-less means if it doesn't need many side-functions/classes).

adityaXXX commented 6 years ago

So can I add tests for Cross-Validation and include its notebook section i.e provide its examples of use in the notebook?

ad71 commented 6 years ago

@MrDupin GridSearchCV and RandomSearchCV are methods that find optimal hyperparameters to use with learning algorithms by iterating through different sets of hyperparameters and measuring the performance of the learning algorithm (on the cross-validation dataset) at each step. @adityaXXX as these are functions from the Scikit-Learn library and not from the book, it is advisable not to replicate them in this repository. The cross-validation function needs to be re-written without diverging from the pseudocode for it to be useful, but at this point I am not sure what should be done.

adityaXXX commented 6 years ago

So should I work on this and add test for the same @MrDupin ? @ad71 ?

antmarakis commented 6 years ago

On the Cross-Validation, you can go right ahead. I wouldn't do the GridSearchCV stuff just yet though, @norvig needs to weigh in on this too.

On Cross Validation, just use the code that is already in place. Don't change the learners or the function, since they are all in accordance to the pseudocode. You will find that it is a bit difficult to use cross-validation in its current state, because of its limitations. I think the only algorithm that can be used with this is the K-Nearest Neighbors learner, so start with it first.

adityaXXX commented 6 years ago

Hey @MrDupin after adding a sample test for cross_validation and doing some changes to test_k_nearest_neighbors function and NearestNeighborLearner classifier(changes in jupyter notebook are yet to be made), I got the following result:

aditya@aditya-Inspiron-3558:~/Downloads/aima-python-master/tests$py.test
============================= test session starts ==============================
platform linux -- Python 3.5.2, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: /home/aditya/Downloads/aima-python-master/tests, inifile: pytest.ini
collected 226 items                                                            

test_agents.py .......                                                   [  3%]
test_csp.py ...........................                                  [ 15%]
test_games.py ...                                                        [ 16%]
test_knowledge.py ........                                               [ 19%]
test_learning.py ...................F                                    [ 28%]
test_logic.py ....................................                       [ 44%]
test_mdp.py ....                                                         [ 46%]
test_nlp.py ....................                                         [ 55%]
test_planning.py .........                                               [ 59%]
test_probability.py ...................                                  [ 67%]
test_rl.py ...                                                           [ 69%]
test_search.py .................                                         [ 76%]
test_text.py ...............                                             [ 83%]
test_utils.py ......................................                     [100%]

=================================== FAILURES ===================================
____________________________ test_cross_validation _____________________________

    def test_cross_validation():
        iris = DataSet(name = "iris")
        kNN = NearestNeighborLearner()
        errT, errV = cross_validation(kNN, 3, iris, 15, 5)
>       assert errV < 0.2
E       assert 0.24133333333333332 < 0.2

test_learning.py:227: AssertionError
==================== 1 failed, 225 passed in 25.05 seconds =====================

The changes I made in NearestNeighborLearner was to simply decorate the decorator. Is the result satisfactory? Should I proceed to add a notebook section?

antmarakis commented 6 years ago

What kind of changes did you make to Nearest Neighbors? I wouldn't really like much to be changed there. As I previously said, cross validation is currently a bit antiquated in the pseudocode and is not very useful as is. So, we shouldn't be working around it.

adityaXXX commented 6 years ago

The changes I made to NearestNeighborLearner are as follows:

def NearestNeighborLearner():
    """k-NearestNeighbor: the k nearest neighbors vote."""
    def fit(dataset, k=1):
        def predict(example):
            """Find the k closest items, and have them vote for the best."""
            best = heapq.nsmallest(k, ((dataset.distance(e, example), e) for e in dataset.examples))
            return mode(e[dataset.target] for (d, e) in best)
        return predict
    return fit

Therefore corresponding changes in test_k_nearest_neighbors are:

def test_k_nearest_neighbors():
    iris = DataSet(name="iris")
    kNN = NearestNeighborLearner()
    kNN = kNN(iris, k = 3)
    assert kNN([5, 3, 1, 0.1]) == "setosa"
    assert kNN([5, 3, 1, 0.1]) == "setosa"
    assert kNN([6, 5, 3, 1.5]) == "versicolor"
    assert kNN([7.5, 4, 6, 2]) == "virginica"
adityaXXX commented 6 years ago

As far as I have checked, the changes I have made does not interfere with any other code block. Is it okay?

antmarakis commented 6 years ago

That is not really consistent with the rest of the codebase though. For learners, we do learner = Learner(dataset), here you changed it to learner = Learner()(dataset).

I am not saying I dislike this, as a matter of fact this is how I originally was thinking of structuring the codebase myself. This way we can then, instead of passing the single argument size on cross validation, pass a list and then from that built the learners.

Ultimately, it is up to @norvig, but I feel we shouldn't start changing the learners unless we know the details of cross_validation. I am again linking to this past issue.

adityaXXX commented 6 years ago

Could you please tell me what I should do next? Should I proceed further @MrDupin and @norvig ?

antmarakis commented 6 years ago

Sorry, this is something only Dr. Norvig and the GSoC mentors can help you with. You might want to contact them directly, since this is an issue that has been hovering over learning.py for some time.

LucianaMarques commented 5 years ago

Hi,

Does this still need help? I would certainly like to try.