Analyses for Paper - Githubissues

tsalo commented 8 years ago

Here we can discuss any post-classification analyses @mriedel56 will perform for the upcoming paper. Feel free to edit this comment to add new analyses or respond to it with your thoughts.

We will run a separate nested CV for each combination of data source, feature space, dimension-wise feature boosting (i.e., yes or no), and classifier. The inner loop in the CV will be for hyper parameter tuning. The best parameters for that loop's model will be determined from the inner loop and a model will be trained on the outer loop's training data using those parameters. By performing this repeatedly, with shuffling, this method will result in ~300 (30 trials * 10 outer folds) estimates of test error per combination. The distributions of test errors will be compared in the analyses.

Comparison of Data Sources

Compare performance of classifiers using abstract, full text, methods section, and methods/abstract combination.

Comparison of Classifiers

Compare performance of different classifiers. All classification will be done using BR within sklearn. Base learners will include:

Bernoulli Naive Bayes
- alpha: [0.01, 0.1, 1, 10]
KNN
- k: [1, 3, 5, 7, 9]
- distance: [L1, L2]
Logistic Regression
- C: [0.01, 0.1, 1, 10, 100]
- penalty: [L1, L2]
Linear SVM
- C: [0.01, 0.1, 1, 10, 100]
- penalty: [L1, L2]
  Comparison of Feature Spaces

Compare performance of classifiers using different feature spaces and combinations of feature spaces. We'll only be including three feature spaces:

Naive bag of words (NBOW)
Cognitive Atlas
Comparison of Dimension-wise Feature Boosting

Compare classifiers that boost dimension-wise classifier performance by predictions from other dimensions against those that don't.

Feed the predictions from each dimension into the features for the next dimension, after ranking dimensions by either classifier performance (if the corpus from the 2013 paper is independent from this one) or qualitative "concreteness"/ease of classification.
Possible Extensions
Conditional F1-scores
- Determine if failure to correctly classify certain labels is correlated with failures/successes of other labels.
Effect of number of positive instances per label on classification accuracy
- Determine if class imbalance or absolute number affect performance.
Effect of proportion of experiments in paper with label on classification accuracy
- Determine if papers that are "more" about a given label are more accurately classified.

mriedel56 commented 8 years ago

Taylor and I have attempted to map out an analysis pipeline, which we are depicting in the attached figure. There are still some holes we need help filling related to the cross-validation step (hopefully @mdtdev can provide some assistance here). However, overall, the text processing, corpora, and general outline are all here.

processing_pipeline

mdtdev commented 8 years ago

BTW what software did you use to make this figure?

mriedel56 commented 8 years ago

OmniGraffle

angielaird commented 8 years ago

So CogAt is only incorporated on the non-stemmed dataset?

On Oct 17, 2016, at 2:46 PM, mriedel56 notifications@github.com wrote:

OmniGraffle

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NBCLab/athena/issues/36#issuecomment-254296816, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBnNc47tikq9ViIFvYqaMzR0WUEWWPQks5q08J2gaJpZM4KVCyD.

mriedel56 commented 8 years ago

Yes, that is correct. We are going to test the feature-spaces (NBOW and CogAt) independently.

mriedel56 commented 8 years ago

The above figure will need to be modified, as we will be performing NBOW vectorization within each fold of CV.

tsalo commented 8 years ago

Based on this paper, we can use the corrected Friedman test for omnibus tests (e.g., are any of the feature sources significantly better) in combination with the posthoc Nemenyi test. I say we do the Friedman test, then Nemenyi test (if Friedman is significant) for each of the comparisons that have more than two levels (i.e., classifiers and feature sources), and we just do the Nemenyi test for comparisons with two levels (i.e., feature spaces and boosting, if we end up doing the latter).

For any nonsignificant comparisons, we would recommend the simplest (e.g., abstracts for feature source) and most theoretically defensible (e.g., CogAt for feature space) of the options.

Granted, we know that the factors are not independent, so we would also show plots of the F1-scores by each factor and raise concerns about any potential interactions we might see.

Does anyone have any thoughts on this?

Addendum: To make inferences across labels/spaces/sources/classifiers, we would treat the estimates for each combo as independent from the estimates from the other combos, like this:

Comparing sources:

Factors-Combo	Abstract	Methods	Combined	Full
Space1-Classifier1-Fold1-Label1	.1	.2	.4	.3
Space1-Classifier1-Fold2-Label1	.2	.2	.3	.3
Space1-Classifier1-Fold1-Label2	.3	.2	.2	.3
Space1-Classifier1-Fold2-Label2	.4	.2	.1	.3
Space1-Classifier2-Fold1-Label1	.1	.2	.4	.3
Space1-Classifier2-Fold2-Label1	.2	.2	.3	.3
Space1-Classifier2-Fold1-Label2	.3	.2	.2	.3
Space1-Classifier2-Fold2-Label2	.4	.2	.1	.3
Space2-Classifier1-Fold1-Label1	.1	.2	.4	.3
Space2-Classifier1-Fold2-Label1	.2	.2	.3	.3
Space2-Classifier1-Fold1-Label2	.3	.2	.2	.3
Space2-Classifier1-Fold2-Label2	.4	.2	.1	.3
Space2-Classifier2-Fold1-Label1	.1	.2	.4	.3
Space2-Classifier2-Fold2-Label1	.2	.2	.3	.3
Space2-Classifier2-Fold1-Label2	.3	.2	.2	.3
Space2-Classifier2-Fold2-Label2	.4	.2	.1	.3

We would then compare across columns to determine if any of the sources (Abstract, Methods, Combined, and Full) perform significantly better than the others.

angielaird commented 8 years ago

That looks like a great paper. Granted, I’ve only skimmed the abstract at this point, but I agree that it will be helpful in determining which tests to use. I’ll try to read more carefully over the next few days. But broadly speaking, your plan seems sound…

On Nov 10, 2016, at 5:01 PM, Taylor Salo notifications@github.com wrote:

Based on this paper http://kt.ijs.si/DragiKocev/wikipage/lib/exe/fetch.php?media=2012pr_ml_comparison.pdf, we can use the corrected Friedman test for omnibus tests (e.g., are any of the feature sources significantly better) in combination with the posthoc Nemenyi test. I say we do the Friedman test, then Nemenyi test (if Friedman is significant) for each of the comparisons that have more than two levels (i.e., classifiers and feature sources), and we just do the Nemenyi test for comparisons with two levels (i.e., feature spaces and boosting, if we end up doing the latter).

For any nonsignificant comparisons, we would recommend the simplest (e.g., abstracts for feature source) and most theoretically defensible (e.g., CogAt for feature space) of the options.

Granted, we know that the factors are not independent, so we would also show plots of the F1-scores by each factor and raise concerns about any potential interactions we might see.

Does anyone have any thoughts on this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NBCLab/athena/issues/36#issuecomment-259820531, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBnNaT3ne9tKkncM6Xz7i6cx7UyCDpyks5q85ROgaJpZM4KVCyD.

mdtdev commented 8 years ago

Yes. Those are the tests from our 2013 paper. They are fairly general.

tsalo commented 8 years ago

I believe you did a 4x7 Friedman test across corpora and dimensions. Are you okay with us doing a series of 1xN (or 9xN) tests instead of an overall 4x4x2x2x9 (source-by-classifier-by-space-by-boosting-by-dimension) test?

tsalo commented 7 years ago

We must account for the thresholds applied in the Naive Bayes and Logistic Regression classifiers, either in the text (providing defense for sklearn's default thresholds) or in the model (including threshold in the grid search).

tsalo commented 7 years ago

Here are some sample results from the CV (for one classifier/feature space/feature source combo, only 6 labels, and only 5 iterations). Output files include the predictions for each iteration (i.e., we take the predictions from each of the test folds and put them back together into one array), the selected classifier params for all of the folds for each iteration, and an F1 array with the score for each iteration, fold, and label. I've uploaded an example of each of these here. Please let me know if any of you think we need additional information.

NBCLab / athena

Analyses for Paper #36

Comparison of Data Sources

Comparison of Classifiers

Comparison of Feature Spaces

Comparison of Dimension-wise Feature Boosting

Possible Extensions