Cross-validation and pruning

aldro61 / mmit

Regression trees for interval censored output data

https://aldro61.github.io/mmit/

GNU General Public License v3.0

7 stars 7 forks source link

Cross-validation and pruning #25

Closed parismita closed 6 years ago

aldro61 commented 6 years ago

@parismita I did a quick code review. Did not dive into the details. Will continue tomorrow. Please add unit tests and compare the results of your pruning functions to those of the python implementation.

parismita commented 6 years ago

ok, i'll add the tests and examples asap.....

aldro61 commented 6 years ago

Do you think it would be better to use namespacing when calling functions inside mmit? For instance, in mmit.cv, you call fit_and_score. What if another user has defined fit_and_score? It might be more robust to use mmit::fit_and_score. However, I'm not familiar with the best practices. This is just my intuition.

parismita commented 6 years ago

Nope, its automatically taken care of by NAMESPACE file, I just checked defining some global function with same name, but the functions used by mmit already use mmit as namespace.

aldro61 commented 6 years ago

Actions items:

Create a benchmark of a few datasets (at least 10) using git.io/mmit-data and compare the MSE of the python and R implementations on a testing set (50% train - 50% test)
Fix the error message in fit_and_score (fold alphas)
Reevaluate objectives once the benchmark results are done.
Check if alpha=inf is often the best for other datasets (would indicate that something is wrong)

aldro61 commented 6 years ago

Ok, I reviewed the benchmark code. It looks fine (see my comments). I noticed that you don't shuffle the data, which could be a problem if the examples are ordered in a particular way, but I think it is ok here, since we are just trying to compare both implementations on the same data.

Please do the comparison with the largest datasets only. The models and accuracies obtained for small datasets are likely to be very unstable.

parismita commented 6 years ago

I didnt shuffle them, so that same data goes into training for both implementations so that trees have higher probability to match