Open adityaXXX opened 6 years ago
About cross-validation, I am not sure at this stage we can do much with the code, unless we rewrite the function. See this.
I have not heard of GridSearchCV
before, it's a learning algorithm, right? If it is not in the book, I am not sure whether it should to be added the repository or not. I would only add it if it doesn't need much code (that more-or-less means if it doesn't need many side-functions/classes).
So can I add tests for Cross-Validation and include its notebook section i.e provide its examples of use in the notebook?
@MrDupin GridSearchCV
and RandomSearchCV
are methods that find optimal hyperparameters to use with learning algorithms by iterating through different sets of hyperparameters and measuring the performance of the learning algorithm (on the cross-validation dataset) at each step.
@adityaXXX as these are functions from the Scikit-Learn library and not from the book, it is advisable not to replicate them in this repository. The cross-validation function needs to be re-written without diverging from the pseudocode for it to be useful, but at this point I am not sure what should be done.
So should I work on this and add test for the same @MrDupin ? @ad71 ?
On the Cross-Validation, you can go right ahead. I wouldn't do the GridSearchCV stuff just yet though, @norvig needs to weigh in on this too.
On Cross Validation, just use the code that is already in place. Don't change the learners or the function, since they are all in accordance to the pseudocode. You will find that it is a bit difficult to use cross-validation in its current state, because of its limitations. I think the only algorithm that can be used with this is the K-Nearest Neighbors learner, so start with it first.
Hey @MrDupin after adding a sample test for cross_validation
and doing some changes to test_k_nearest_neighbors
function and NearestNeighborLearner
classifier(changes in jupyter notebook are yet to be made), I got the following result:
aditya@aditya-Inspiron-3558:~/Downloads/aima-python-master/tests$py.test
============================= test session starts ==============================
platform linux -- Python 3.5.2, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: /home/aditya/Downloads/aima-python-master/tests, inifile: pytest.ini
collected 226 items
test_agents.py ....... [ 3%]
test_csp.py ........................... [ 15%]
test_games.py ... [ 16%]
test_knowledge.py ........ [ 19%]
test_learning.py ...................F [ 28%]
test_logic.py .................................... [ 44%]
test_mdp.py .... [ 46%]
test_nlp.py .................... [ 55%]
test_planning.py ......... [ 59%]
test_probability.py ................... [ 67%]
test_rl.py ... [ 69%]
test_search.py ................. [ 76%]
test_text.py ............... [ 83%]
test_utils.py ...................................... [100%]
=================================== FAILURES ===================================
____________________________ test_cross_validation _____________________________
def test_cross_validation():
iris = DataSet(name = "iris")
kNN = NearestNeighborLearner()
errT, errV = cross_validation(kNN, 3, iris, 15, 5)
> assert errV < 0.2
E assert 0.24133333333333332 < 0.2
test_learning.py:227: AssertionError
==================== 1 failed, 225 passed in 25.05 seconds =====================
The changes I made in NearestNeighborLearner
was to simply decorate the decorator.
Is the result satisfactory?
Should I proceed to add a notebook section?
What kind of changes did you make to Nearest Neighbors? I wouldn't really like much to be changed there. As I previously said, cross validation is currently a bit antiquated in the pseudocode and is not very useful as is. So, we shouldn't be working around it.
The changes I made to NearestNeighborLearner
are as follows:
def NearestNeighborLearner():
"""k-NearestNeighbor: the k nearest neighbors vote."""
def fit(dataset, k=1):
def predict(example):
"""Find the k closest items, and have them vote for the best."""
best = heapq.nsmallest(k, ((dataset.distance(e, example), e) for e in dataset.examples))
return mode(e[dataset.target] for (d, e) in best)
return predict
return fit
Therefore corresponding changes in test_k_nearest_neighbors
are:
def test_k_nearest_neighbors():
iris = DataSet(name="iris")
kNN = NearestNeighborLearner()
kNN = kNN(iris, k = 3)
assert kNN([5, 3, 1, 0.1]) == "setosa"
assert kNN([5, 3, 1, 0.1]) == "setosa"
assert kNN([6, 5, 3, 1.5]) == "versicolor"
assert kNN([7.5, 4, 6, 2]) == "virginica"
As far as I have checked, the changes I have made does not interfere with any other code block. Is it okay?
That is not really consistent with the rest of the codebase though. For learners, we do learner = Learner(dataset)
, here you changed it to learner = Learner()(dataset)
.
I am not saying I dislike this, as a matter of fact this is how I originally was thinking of structuring the codebase myself. This way we can then, instead of passing the single argument size
on cross validation, pass a list and then from that built the learners.
Ultimately, it is up to @norvig, but I feel we shouldn't start changing the learners unless we know the details of cross_validation
. I am again linking to this past issue.
Could you please tell me what I should do next? Should I proceed further @MrDupin and @norvig ?
Sorry, this is something only Dr. Norvig and the GSoC mentors can help you with. You might want to contact them directly, since this is an issue that has been hovering over learning.py
for some time.
Hi,
Does this still need help? I would certainly like to try.
Hey @norvig, @MrDupin: currently, the learning.ipynb notebook does not have examples of cross-validation. We can include cross-validation (for example) to perceptron classifier to show how its accuracy can be further increased and similarly for some of the other machine learning classifiers. We can also include code for GridSearchCV and then also for RandomSearchCV to show how it is computationally better than GridSearchCV.
Should I go for it?