Apply Feature Selection to the URL data used in ECTA

amorag commented 9 years ago

And create two tables with the results: 1) Selected Features 2) Classification Accuracies per method with the new features over every dataset (similar table to the existing one)

unintendedbear commented 9 years ago

You should specify which are "every dataset", to see exactly how many experiments I need to do, and planning myself.

amorag commented 9 years ago

For the sake of a good comparison I think that the datasets (and methods) should be the same as in ECTA, but with the selected features.

Thus, you should apply the classification methods to the data referred in Tables 3 and 4 of the ECTA paper. Namely they are:

METHODS:

J48, RepTree, Random Forest, NNge, PART

DATASETS:

Unbalanced (Random and Sequential): 80-20, 90-10
Balanced Undersampled (Random and Sequential): 80-20, 90-10
Balanced Oversampled (Random and Sequential): 80-20, 90-10

If I'm not wrong, these are the last datasets (before the new generated features).

amorag commented 9 years ago

In my opinion, you should start from the non-balanced data, since the results are a bit better than for balanced ones.

If there is enough time, then go for the balanced.

unintendedbear commented 9 years ago

Ok... we should (really), be realistic here.

What you're asking, now, is to make almost all the experiments I did in one year, but in a week and a half.

For one person it's impossible, and also for two, given all the other paper-tasks that are to complete.

I don't know what @zeinebchelly and @JJ think. But it's an important matter.

amorag commented 9 years ago

Ok, but your work, also included the generation of the datasets, which was a difficult task. Now, all this work is done. I know these are several experiments, but not the same that you did in one year.

Let's go then for Unbalanced data. If I'm not wrong, they would be:

5 methods x (3 random distributions in 80-20 + 1 sequential distribution in 90-10) + 5 methods x (3 random distributions in 90-10 + 1 sequential distribution in 90-10)

Total: 40 experiments or runs

Could it be done? We need at least these results for improving the paper with new contents.

Thank you!

zeinebchelly commented 9 years ago

Dear ALL,

I hope you are doing well :)

Thanks again for this collaboration and I am glad beeing part of it.

I have just discussed with Paloma about the experiments and I need more details, please. Antonio, I am wondering if you can explain me why this specific choice of these experiments (the 40 runs). That will help me to have a "more" clear vision of our methodology adopted in the experimental setup section.

I will be waiting for your email.

Thank you :)

2015-02-18 14:06 GMT+01:00 Antonio Mora notifications@github.com:

Ok, but your work, also included the generation of the datasets, which was a difficult task. Now, all this work is done. I know these are several experiments, but not the same that you did in one year.

Let's go then for Unbalanced data. If I'm not wrong, they would be:

5 methods x (3 random distributions in 80-20 + 1 sequential distribution in 90-10) + 5 methods x (3 random distributions in 90-10 + 1 sequential distribution in 90-10)

Total: 40 experiments or runs

Could it be done? We need at least these results for improving the paper with new contents.

Thank you!

— Reply to this email directly or view it on GitHub https://github.com/geneura-papers/2015-CIDS/issues/1#issuecomment-74860566 .

Kind regards,

Dr. Zeineb Chelly Lecturer in Computer Science University of Tunis, Department of Computer Science, Institut Supérieur de Gestion de Tunis (ISG), 2000 Le Bardo, Tunis -TUNISIA http://www.isg.rnu.tn/ LARODEC Laboratory, ISG-Tunis, Tunisia http://www.larodec.com/

unintendedbear commented 9 years ago

40 experiments (which, btw, are not only experiments, but generating the training/test sets and so on) is ok but, still is not clear if I should take care of the core domain being clearly separated or not.

JJ commented 9 years ago

I don't think all methods should be tested. What's the objective? Why do you want to test them all?

unintendedbear commented 9 years ago

I think they're the same top 5 we obtained for the ECTA paper, that's why... (me thinks)

amorag commented 9 years ago

Dear all, some clarificatons from my point of view.

The aim was to extend the ECTA paper (which classified URLs data using Rule-based and Tree-based methods). This would be done using a Feature Selection process (by Zeineb).

JJ, we are not testing all the methods, but the chosen 5 (the best), as Paloma said. Moreover, we have reduced the experiments to 1/3rd. The objective is to test how a feature selection process could improve the obtained results, and which features could be removed without influence in the classification accuracies.

Zeineb, Paloma did a set of experiments considering unbalanced data, and also balanced data applying both over- and undersampling methods. She created some distributions on training/test files, namely in this paper 80-20% and 90-10%. Then, inside every subset, we considered a sequential distribution of the patterns (they are sorted in time), and also a random distribution. In this random distribution, she generated three different pairs of files, in order to obtain the average accuracy.

In this paper I'm proposing to consider the best set of results (those obtained for unbalanced data) and apply the feature selection. This method just can be applied on the global unbalanced dataset in order to choose the best features.

After this, the classification process should be repeated for the same methods and with the same distribution. However, as we tested yesterday, the datasets must not be generated again. We can use the same as we have, and choose in Weka which features to consider or to remove in the preprocessing menu, once a dataset is loaded. Then apply the 5 methods to that dataset.

I think this proposal should be the minimum to do in order to reach some well-founded conclusions, since we can check which techniques get more profit of the feature reduction, if this improves the classification accuracies, and so on.

Let me know your opinion.

Cheers, Antonio

JJ commented 9 years ago

OK

2015-02-19 9:54 GMT+01:00 Antonio Mora notifications@github.com:

Dear all, some clarificaton from my point of view.

The aim was to extend the ECTA paper (which classified URLs data using Rule-based and Tree-based methods). This would be done using a Feature Selection process (by Zeineb).

JJ, we are not testing all the methods, but the chosen 5 (the best), as Paloma said. Moreover, we have reduced the experiments to 1/3rd. The objective is to test how a feature selection process could improve the obtained results, and which features could be removed without influence in the classification accuracies.

Zeineb, Paloma did a set of experiments considering unbalanced data, and also balanced data applying both over- and undersampling methods. She created some distributions on training/test files, namely 80-20% and 90-10%. Then, inside every subset, we considered a sequential distribution of the patterns (they are sorted in time), and also a random distribution. In this random distribution, she generated three different pairs of files, in order to obtain the average accuracy.

In this paper I'm proposing to consider the best set of results (those obtained for unbalanced data) and apply the feature selection. This method just can be applied on the global unbalanced dataset in order to choose the best features.

After this, the classification process should be repeated for the same methods and with the same distribution. However, as we tested yesterday, the datasets must not be generated again. We can use the same as we have, and choose in Weka which features to consider or to remove in the preprocessing menu, so once a dataset is loaded. Then apply the 5 methods to that dataset.

I think this proposal should be the minimum to do in order to reach some well-founded conclusions, since we can check which techniques get more profit of the feature reduction, if this improves the classification accuracies, and so on.

Let me know your opinion.

Cheers, Antonio

— Reply to this email directly or view it on GitHub https://github.com/geneura-papers/2015-CIDS/issues/1#issuecomment-75016800 .

JJ

unintendedbear commented 9 years ago

After this, the classification process should be repeated for the same methods and with the same distribution. However, as we tested yesterday, the datasets must not be generated again. We can use the same as we have, and choose in Weka which features to consider or to remove in the preprocessing menu, once a dataset is loaded. Then apply the 5 methods to that dataset.

We can use this for cross-validation, but, have you tested this for training and test files? When you use these, you load the training file and ok, you can choose the features. But, does it automatically take the correct ones from the test file you're giving to weka?

Edit: Answering myself, yes, it's possible.

amorag commented 9 years ago

I have tried and no, if you select the features then you cannot use your own test file with the whole set.

Thus, there are some options: 1) generate again the datasets with the selected features 2) generate just the test files and filter the training ones in WEKA 3) (if there is not enough time) use the whole dataset and make the splitting in WEKA (you can split the files automatically setting a percentage in the classification menu). The problem of this, JJ and Zeineb, is that we could not decide about the sequential data and that the distribution would include duplicated urls in training and test files, which was one of the reasons for doing this by means of a script or external program.

Paloma, regarding you question about using the data with the separated core domain or not, I think we should use the same as in ECTA paper, even if we have now 'better' data.

Thank you.

zeinebchelly commented 9 years ago

Dear ALL,

Paloma and I have just finished all experiments with success !

We have obtained interesting results with rough sets.

I am wondering if you see that it will be OK to just focus on rough sets for the results or do you think that we have to add other feature selection techniques and compare them to Rough sets?

What do you think?

2015-02-19 12:44 GMT+01:00 Antonio Mora notifications@github.com:

Assigned #1 https://github.com/geneura-papers/2015-CIDS/issues/1 to @zeinebchelly https://github.com/zeinebchelly.

— Reply to this email directly or view it on GitHub https://github.com/geneura-papers/2015-CIDS/issues/1#event-238316433.

Kind regards,

Dr. Zeineb Chelly Lecturer in Computer Science University of Tunis, Department of Computer Science, Institut Supérieur de Gestion de Tunis (ISG), 2000 Le Bardo, Tunis -TUNISIA http://www.isg.rnu.tn/ LARODEC Laboratory, ISG-Tunis, Tunisia http://www.larodec.com/

JJ commented 9 years ago

For me it's OK to focus on rough sets.

Antonio, you lost me. Why can't you use the same file as before? it's just a matter of deleting the features you don't use.

amorag commented 9 years ago

Great, Please add the results to the paper when you can. ;)

JJ, I tried to answer a question by Paloma. I did a test and didn't work (for me), so I offered some other options to avoid the re-generation of files because it's not as easy as it seems. However Paloma could do it so finally there is no problem and we can just remove non-selected features from the files.

Cheers.

unintendedbear commented 9 years ago

Antonio, the results are already there and the issue is closed...

amorag commented 9 years ago

Sorry, I didn't check the paper.

Thank you.

amorag commented 9 years ago

Could you please write a couple of paragraphs describing/comparing the results?

Now there is just the text: "And now we see in Table 5 and Table 6 that our results are awesome and your argument is invalid."

Who do you refer with 'our' and with 'your'?

The results are very good, excellent, but you should also include the list of selected features in order to extract some additional conclusions on them.

unintendedbear commented 9 years ago

Ok...

There's already issue #3 for comparing results, isn't it? Also, that sentence was obviously a joke. I put it just to force the tables appearing in the PDF by referring them.

amorag commented 9 years ago

The idea in that issue was to compare the classifiers themselves, not the classification results.

As it is written in the first comment, try to compare the times to build the model, the number of rules or trees generated, and any other useful parameter that could be improved after the feature selection.

One of the main advantages of using features selection is to get an improvement in these factors (mainly the computational time, of course), so in this comparison we prove that this has happen.

EDIT: I wrote also about the accuracy in that issue and could be understood that this should be compared there. However this just try to introduce the aim of the proposed comparisons.

geneura-papers / 2015-CIDS

Apply Feature Selection to the URL data used in ECTA #1

Kind regards,

Kind regards,