Closed mriedel56 closed 4 years ago
1) What do you mean by similar terms? Do you have any examples I can look at?
2) We can change the threshold, though I think your analyses for the paper will be helpful in figuring out the best threshold. We won't have control over the outside literature, so I don't think it would be a good idea to manipulate the train/test split to maximize accuracy. It will reduce the test dataset's generalizability.
3) We played with the parameters a bit. They might actually currently be set to the same values Dane used. However, overall there was no real citable rationale. We just used what we thought made sense.
It is strict and only counts instances of the exact term. I played around with fuzzy string matching using fuzzywuzzy, but I didn't get too far with it. Some tools, like DeepDive I believe, are smarter about it, but they're very complicated to use. The examples you have ("extend", "extens", "extent") are the result of stemming, and still differ because that's just the limitation of the stemming algorithm. We can certainly look into more advanced tools.
4) Normalizing is pretty standard, but the reason we did it is that our NLP professor mentioned doing it in class. We didn't have more of a rationale than that, and I don't know if it's problematic when applied on top of tf-idf (but I don't think it would be).
1) Look into optimizing split_data(). @mdtdev do you think splitting the data to specify the number of metadata instances in the test and training datasets is ok?
2) Is it ok to do a leave-one-out cross-validation to estimate generalizability instead of a standard train/test split?
Additional points will be addressed in #33
@mriedel56 It is not great to have to control the splits, but given the limits of our metadata it is justifiable. As long as we are clear about it. There is no justification that has been made for the use of "at least 10" -- why was that chosen?
LOOCV is biased. So it is usually best to avoid it. I don't have the reference on bias handy but the logic is clear: in LOOCV you are making N models, where each model is mostly correlated with any of the other models (compare 2 models each of which leaves a different one out, then these models are based on N-2 identical points). I forget the specifics of the bias, but that is clearly correlation baised. Make sense?
When using a minimum label wise count of 30, we chose to have at least 10 papers in each dataset to allow for a 10 paper buffer. In a perfect splitting situation, we would be able to put exactly 20 papers in the training dataset and exactly 10 in the test dataset (because we did a 2/1 split), but I figured that there could be dependencies between labels that would make a perfect split impossible.
Just out of curiosity, how are you handling issues of cross-validation and splitting now? I vaguely recall how Dane was doing it, and he had altered some of his predecessor's choices. But I have lost track of how all of this is being handled.
Specifically, (1) how are you balancing training of the classifiers? (2) how, if relevant, are you handling the splitting/CV on parameter setting for classifiers that require it? and (3) how are you doing CV/splitting/whatever on the final evaluations?
Also, are you doing any bootstrapping or other procedures for the final estimates? A la chapter 7 of the Bible?
We perform an initial train/test split that forces both datasets to have at least 10 instances of each labels each. We only use cross-validation in the feature selection step, where we train a BR/SVM classifier on the training dataset using different combinations of features and compare the CV-averaged F-scores. From that we select the most useful features for the final model.
We then train classifiers on the training dataset with those features and evaluate on the test dataset.
We really didn't put much thought into how we implemented CV, since we just did it for feature selection, so I'm sure there are flaws in our approach. Unfortunately, all of your questions from the second paragraph go over my head.
We haven't done any bootstrapping or anything else on the final estimates. This is actually the first I'm hearing of that stuff, but I think we should definitely do something if that's what's standard.
Well I guess my main question is this: the final reported F1's in the paper should be numbers with some statistical bounds on them. If all you did was grab a single test set (10 positive instances) then the best you can achieve is a single very rough estimate of the final F1. (Rough in the sense that the value will be based on a computation that has at most 10 values in one of the slots.) Is that where your pipeline ends or am I missing something?
The split enforces at least 10 positive instances, so most labels will have more than 10 positive instances in the test dataset, but you're right in that that's where the pipeline ends.
We could use CV in the final model, but ensuring proper splits, so we aren't training/testing models on CVed datasets with empty labels, seemed too difficult. Is that how you would determine bounds for the final F1s?
In the 2013 work we used CV at that stage, ignoring the balancing of term presence. We were interested in an extreme lower bound. Our performance in that paper would be substantially better if we had balanced data.
I am not sure what the best approach is, to be honest. Traditionally there would be a train-validate-test (3 way split) with train for training, validate for setting local parameters for the model, and test for the final testing and estimating out-of-sample error. Possibly all of this within a larger CV wrapper (repeating the 3 way split across the entire data or something).
But that is not really usable here for us as we have such limited data. Can you give me some idea of the range of label counts? What is that distribution? Maybe we can do 3 way splits with an outer wrapper?
Are you doing unbalanced training where you guarantee a minimum of positive labels but allow all of the negative cases?
Tomorrow, when I can get on our HPC, I will re-run our labeling script without thresholding and create a histogram and a count file, which I'll put up here.
We don't perform any thresholding of negative cases, because there weren't any labels that I saw where that would be a problem. If I missed any overpopulated labels, I'm sure the histogram and count file will help us identify them.
Sounds good.
The histograms are uploaded here. This is the count file.
@tsalo
A few questions concerning functions in data_preparation.py
1) process_corpus Im sure a lot of the term manipulation is built in, but are we sure its doing what we want it to? Taking a look at the resulting processed data, it seems like a lot of characters are truncated, but many terms that are similar still exist. This part actually leads in to my 3rd question...
2) split_data Overall Im fine with this part, just a couple thoughts: a) Can we increase the minimum of 10 instances, is there a threshold there to report? b) I imagine random assignment to test/training datasets is probably the best approach, but is there a way to look at minimizing the dissimilarity between test and training in an effort to get the highest accuracies? Or is that cheating, and would potentially result in the inability to characterize minimally investigated studies?
3) gazetteers This relates to just the NBOW bc thats all I've really taken a look at. The tfidf vectorizer function….Did y'all investigate those parameters (min_df, max_df, max_features), or were those chosen from some reference? Also, how does this function consider strings that are similar? Is it a strict, this term, exactly as it appears, is present in the pre-processed text, or if a portion of the term exists, its a count. For example in the NBOW, there is "extend" "extens" "extent", are all those treated completely independent from one another? Maybe we could do a little more hedging here. It might be exhaustive, but could we develop like term similarity stats, such that all three of those terms above could just be replaced by "extent"? Since context within the text doesn't matter, I wouldn't find that problematic. Thoughts?
4) extract_features.extract_nbow You call tfdif.transform here, which I assume first tallies the raw counts, then normalizes according to the frequency of overall terms in the document, right? Then you apply another normalization for the sum of the frequency of a term across documents, right? Can you do that? Like normalize using the sum of frequencies?