biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.87k stars 1.02k forks source link

Help with implementation of new Test & Scores function (Independent Validation) #3471

Closed flave95 closed 5 years ago

flave95 commented 5 years ago

Hallo Ladys and Gentleman, as already dispalyed in the issue name, you can see that I need a bit of help for the new function IV. For those who are not familiar with the topic or former issues about Independent Validation I will explain the "What about`s"/ "Why´s" and "How´s" down below. If you know it already please just ignore it.

In my bachelor thesis i'm working on an implementation of the new validation method "Independent Validation on Classifier results" from Prof. Dr. von Oertzen and Kim Bommae in Orange Canvas. In short, they both proofed that CV is baised and statistically not independent. https://link.springer.com/article/10.3758%2Fs13428-017-0880-z Definition IV: "The IV procedure takes an initial training set of data, which is never used for testing, and a test size, with which a classifier is tested at each validation. Once a classifier is trained using the initial training set, the classifier is tested using a test set of the test size, randomly chosen from the rest of data. The used test set is combined to the training set for the next validation, and the classifier is trained using the increased training set and tested using a new test set. Therefore, each validation is completely independent in the IV procedure."

My idea is to implement a function which should set the self.indices up in some manner as LOO but a in order of the IV definition. See the example in Expected behavior.

Now I have two issues:

  1. The IV does not calculate properly. Needs very big amount of data. At least 100 Datapoints. I think its because I still could not manage to set up the Arrays propperly. The Error says thats its to much values or not enough ( got 1 - need 2). If i enumerate the self.indices it shows at the begining only 1 value. With only 1 Value its obviously that the calculation fails. But I cant dubug it.

My aim is to achieve an structure as following: [ ( array ( [ trainset ] ), array ( [ test ] ) ) ] thought that we would have 2 vaules.

  1. Im not a Coder. Im a psychologist with a hobby in coding. Because of that i couldnt code the whole function in a good, efficient and "pretty" way. Please help me with your coding exp. to make it faster and better understandable for other Coder.
Orange version 3.15.0
Expected behavior

DEBUG: Printing self.indices for IV. [(array([0,1, 2, 3, 4]), array([5])), (array([0, 1, 2, 3, 4, 5]), array([6])), (array([0, 1, 2, 3, 4, 5, 6 ]), array([7])), (array([0, 1, 2, 3, 4, 5, 6, 7]), array([8])), (array([0, 1, 2, 3, 4, 5, 6, 7, 8]), array([9]))]

Actual behavior

self.indices [([0, 1], [2]), ([0, 1, 2], [3]), ([0, 1, 2, 3], [4]), ([0, 1, 2, 3, 4], [5]), ([0, 1, 2, 3, 4, 5], [6]), ([0, 1, 2, 3, 4, 5, 6], [7]), ([0, 1, 2, 3, 4, 5, 6, 7], [8]), ([0, 1, 2, 3, 4, 5, 6, 7, 8], [9]), [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [10])]

testing error (in __task_complete):
c:\program files (x86)\orange3\orange3-master\Orange\evaluation\testing.py", line 323, in <genexpr>
    for fold_i, (train_i, test_i) in enumerate(self.indices))
ValueError: too many values to unpack (expected 2)
Additional info (worksheets, data, screenshots, ...)

MY code `class IndependentValidation(Results): """ Independent Validation Testing split data into test_data, train_data. split test_data in n_splits = number of observations ---Loop--- for each n_split_test_data do: train on train_data predict and test on 1 observation = (1_split_test_data) note whrite or wrong prediction in results (self.results?) add used 1_split_test_data to train_data (re)move used 1_split_test_data from test_data do until test_data=0 / while test_data!=0

"""
score_by_folds = False

def __init__(self, data, learners, store_data=False, store_models=False, preprocessor=None,
            callback=None, n_jobs=1, train_size=None, test_size=0.9, random_state=42):
    print(data)

    self.train_size = train_size
    self.test_size = test_size
    self.random_state = random_state

    super().__init__(data, learners=learners, store_data=store_data, store_models=store_models,
                     preprocessor=preprocessor, callback=callback, n_jobs=n_jobs)

def setup_indices(self, train_data, test_data):
    print(self)
    print("__________test1___________")
    train_data, test_data = skl.train_test_split(self.data, test_size=0.8)
    #shuffle = True, stratify = None
    tt_lst_n = list(range(_num_samples(train_data)))
    test_data1 = list(range(len(tt_lst_n), _num_samples(test_data)))
    print("1", (train_data))
    print("2", test_data)
    print("3", tt_lst_n)
    print("4", test_data1)
    StratArr = []
    """StratArr = (array 1[ tt-liste =( [train_data +( 0+ sliceX)]+[test sliceX]), tt-liste2 = ([train_data+x]+[x+1])]"""
    lngth = len(test_data1)
    while lngth != 1:
        x, test_data1 = test_data1[+0] , test_data1[::-1]
        y, test_data1 = test_data1[+0], test_data1[:-1]
        z, test_data1 = test_data1[+0], test_data1[::-1]
        StratArr.insert(len(StratArr), ((tt_lst_n), ([x])))
        x = list([x])
        tt_lst_n = tt_lst_n + x
        lngth = lngth - 1
    StratArr.insert(len(StratArr), ((tt_lst_n), (test_data1)))
    self.indices = StratArr
    print("self.indices", self.indices)`
markotoplak commented 5 years ago

Well, you can certainly implement IV for yourself, but currently I do not see any reason for IV to go into the official Orange distribution.

The paper you link does not describe that method well and, given current information, I do not see any reason why IV would better than just splitting data into a training and testing set once. Unless there is a strong case (which you have not presented yet) why IV is better than the current methods, we will not merge it into Orange.

Feel free to comment, but as this issue does not describe a bug in Orange, I am closing the issue for now.