Closed tsalo closed 4 years ago
Taylor and I have attempted to map out an analysis pipeline, which we are depicting in the attached figure. There are still some holes we need help filling related to the cross-validation step (hopefully @mdtdev can provide some assistance here). However, overall, the text processing, corpora, and general outline are all here.
BTW what software did you use to make this figure?
OmniGraffle
So CogAt is only incorporated on the non-stemmed dataset?
On Oct 17, 2016, at 2:46 PM, mriedel56 notifications@github.com wrote:
OmniGraffle
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NBCLab/athena/issues/36#issuecomment-254296816, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBnNc47tikq9ViIFvYqaMzR0WUEWWPQks5q08J2gaJpZM4KVCyD.
Yes, that is correct. We are going to test the feature-spaces (NBOW and CogAt) independently.
The above figure will need to be modified, as we will be performing NBOW vectorization within each fold of CV.
Based on this paper, we can use the corrected Friedman test for omnibus tests (e.g., are any of the feature sources significantly better) in combination with the posthoc Nemenyi test. I say we do the Friedman test, then Nemenyi test (if Friedman is significant) for each of the comparisons that have more than two levels (i.e., classifiers and feature sources), and we just do the Nemenyi test for comparisons with two levels (i.e., feature spaces and boosting, if we end up doing the latter).
For any nonsignificant comparisons, we would recommend the simplest (e.g., abstracts for feature source) and most theoretically defensible (e.g., CogAt for feature space) of the options.
Granted, we know that the factors are not independent, so we would also show plots of the F1-scores by each factor and raise concerns about any potential interactions we might see.
Does anyone have any thoughts on this?
Comparing sources:
Factors-Combo | Abstract | Methods | Combined | Full |
---|---|---|---|---|
Space1-Classifier1-Fold1-Label1 | .1 | .2 | .4 | .3 |
Space1-Classifier1-Fold2-Label1 | .2 | .2 | .3 | .3 |
Space1-Classifier1-Fold1-Label2 | .3 | .2 | .2 | .3 |
Space1-Classifier1-Fold2-Label2 | .4 | .2 | .1 | .3 |
Space1-Classifier2-Fold1-Label1 | .1 | .2 | .4 | .3 |
Space1-Classifier2-Fold2-Label1 | .2 | .2 | .3 | .3 |
Space1-Classifier2-Fold1-Label2 | .3 | .2 | .2 | .3 |
Space1-Classifier2-Fold2-Label2 | .4 | .2 | .1 | .3 |
Space2-Classifier1-Fold1-Label1 | .1 | .2 | .4 | .3 |
Space2-Classifier1-Fold2-Label1 | .2 | .2 | .3 | .3 |
Space2-Classifier1-Fold1-Label2 | .3 | .2 | .2 | .3 |
Space2-Classifier1-Fold2-Label2 | .4 | .2 | .1 | .3 |
Space2-Classifier2-Fold1-Label1 | .1 | .2 | .4 | .3 |
Space2-Classifier2-Fold2-Label1 | .2 | .2 | .3 | .3 |
Space2-Classifier2-Fold1-Label2 | .3 | .2 | .2 | .3 |
Space2-Classifier2-Fold2-Label2 | .4 | .2 | .1 | .3 |
We would then compare across columns to determine if any of the sources (Abstract, Methods, Combined, and Full) perform significantly better than the others.
That looks like a great paper. Granted, I’ve only skimmed the abstract at this point, but I agree that it will be helpful in determining which tests to use. I’ll try to read more carefully over the next few days. But broadly speaking, your plan seems sound…
On Nov 10, 2016, at 5:01 PM, Taylor Salo notifications@github.com wrote:
Based on this paper http://kt.ijs.si/DragiKocev/wikipage/lib/exe/fetch.php?media=2012pr_ml_comparison.pdf, we can use the corrected Friedman test for omnibus tests (e.g., are any of the feature sources significantly better) in combination with the posthoc Nemenyi test. I say we do the Friedman test, then Nemenyi test (if Friedman is significant) for each of the comparisons that have more than two levels (i.e., classifiers and feature sources), and we just do the Nemenyi test for comparisons with two levels (i.e., feature spaces and boosting, if we end up doing the latter).
For any nonsignificant comparisons, we would recommend the simplest (e.g., abstracts for feature source) and most theoretically defensible (e.g., CogAt for feature space) of the options.
Granted, we know that the factors are not independent, so we would also show plots of the F1-scores by each factor and raise concerns about any potential interactions we might see.
Does anyone have any thoughts on this?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NBCLab/athena/issues/36#issuecomment-259820531, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBnNaT3ne9tKkncM6Xz7i6cx7UyCDpyks5q85ROgaJpZM4KVCyD.
Yes. Those are the tests from our 2013 paper. They are fairly general.
I believe you did a 4x7 Friedman test across corpora and dimensions. Are you okay with us doing a series of 1xN (or 9xN) tests instead of an overall 4x4x2x2x9 (source-by-classifier-by-space-by-boosting-by-dimension) test?
We must account for the thresholds applied in the Naive Bayes and Logistic Regression classifiers, either in the text (providing defense for sklearn's default thresholds) or in the model (including threshold in the grid search).
Here are some sample results from the CV (for one classifier/feature space/feature source combo, only 6 labels, and only 5 iterations). Output files include the predictions for each iteration (i.e., we take the predictions from each of the test folds and put them back together into one array), the selected classifier params for all of the folds for each iteration, and an F1 array with the score for each iteration, fold, and label. I've uploaded an example of each of these here. Please let me know if any of you think we need additional information.
Here we can discuss any post-classification analyses @mriedel56 will perform for the upcoming paper. Feel free to edit this comment to add new analyses or respond to it with your thoughts.
We will run a separate nested CV for each combination of data source, feature space, dimension-wise feature boosting (i.e., yes or no), and classifier. The inner loop in the CV will be for hyper parameter tuning. The best parameters for that loop's model will be determined from the inner loop and a model will be trained on the outer loop's training data using those parameters. By performing this repeatedly, with shuffling, this method will result in ~300 (30 trials * 10 outer folds) estimates of test error per combination. The distributions of test errors will be compared in the analyses.
Comparison of Data Sources
Compare performance of classifiers using abstract, full text, methods section, and methods/abstract combination.
Comparison of Classifiers
Compare performance of different classifiers. All classification will be done using BR within sklearn. Base learners will include:
Comparison of Feature Spaces
Compare performance of classifiers using different feature spaces and combinations of feature spaces. We'll only be including three feature spaces:
Comparison of Dimension-wise Feature Boosting
Compare classifiers that boost dimension-wise classifier performance by predictions from other dimensions against those that don't.
Possible Extensions