NBCLab / athena

Tool for mining and synthesis of cognitive neuroimaging tasks.
https://doi.org/10.3389/fnins.2019.00494
2 stars 1 forks source link

Describe v2.0. #21

Closed tsalo closed 4 years ago

tsalo commented 7 years ago

@mdtdev @mriedel56 This seems like a good place to give a brief overview of the pipeline Jason and I used for our version of ATHENA. Based on what you two think, we can figure out the ultimate structure for the project. I want to differentiate between the paper and the tool, though. I think the paper should include more features, which will be reduced via feature selection. Additionally, the paper will include more visualization and evaluation methods than the final tool.

The feature spaces Jason and I tested out:

tsalo commented 7 years ago

Here are the general steps to the pipeline we used:

  1. Process corpus:
    1. Extract text from pdfs using Java code if text-based and OCI if image-based.
    2. Expand abbreviations in text using regular expressions written by Jason.
    3. Convert British words to American form.
    4. Created stemmed corpus in text/stemmed_full/ folder.
    5. Separate references sections and save to text/references/ folder.
  2. Process labels:
    1. Convert labels to hierarchical form. This involves changing some labels that weren't hierarchical, but we thought should have been (e.g. Word Generation (Covert) became WordGeneration.Covert).
    2. Convert csv files to matrix form (papers x terms).
      • This includes counting specific labels toward their parents (e.g. BehavioralDomain.Perception.Somesthesis.Pain counts toward BehavioralDomain.Perception.Somesthesis and BehavioralDomain.Perception).
    3. Remove labels with fewer than X positive instances (we used 5 in our project but recommend 30).
    4. Calculate dataset statistics:
      • Number of instances
      • Number of features
      • Number of labels
      • Label cardinality
      • Label density
      • Number of unique labelsets
  3. Split corpus:
    • Randomly split into training and test datasets (67/33).
    • We made sure that each label occurred at least once (should be 10x) in the test dataset and twice (should be 10x) in the training dataset.
  4. Generate gazetteer for each feature space (n=7) using the full dataset. <- Probably should have used just the training dataset.
  5. Extract features for training and test datasets.
  6. Run feature space selection with the training dataset.
    1. Concatenate all combinations of feature count matrices.
    2. Perform classification on each feature combination with "simple" multilabel classifier (sklearn's BR on top of linear SVC), using 10-fold cross-validation.
    3. Evaluate each combination's performance using F1, precision, and recall.
    4. Determine (qualitatively) the best performing combination with the fewest features. We chose Cognitive Atlas/NBOW/Title words.
  7. Perform "real" classification:
    1. Convert labels and features to meka format.
    2. Run 5 classifiers using MEKA. Jason wrote the code to do this.
    3. Take majority vote across classifiers.
    4. Evaluate performance on test dataset with F1, precision, and recall.
tsalo commented 7 years ago

We are dropping several of the feature spaces and switching from hold-out based evaluation to cross-validation with StratifiedKFold (to preserve class proportions) and nesting (to perform hyperparameter tuning/potentially feature selection.