Describe v2.0. - Githubissues

@mdtdev @mriedel56 This seems like a good place to give a brief overview of the pipeline Jason and I used for our version of ATHENA. Based on what you two think, we can figure out the ultimate structure for the project. I want to differentiate between the paper and the tool, though. I think the paper should include more features, which will be reduced via feature selection. Additionally, the paper will include more visualization and evaluation methods than the final tool.

The feature spaces Jason and I tested out:

Cognitive Atlas
- Rationale: The CogAt is a domain-specific ontology with predefined relationships between terms ("assertions"). Terms appear in natural language form, and many contain synonyms/alternate forms.
- Gazetteer generation: Terms were pulled from the site using cogat-python. Relationship weights were defined by Jason and me. We also generated extensive sets of additional alternate forms using a series of rules here, which were added to the gazetteer. Terms were identified by the unique CogAt ID.
- Feature extraction: Terms and alternate forms are extracted from the full text with tf-idf weighting. Counts of related terms are affected by computing the dot product between count matrix (paper x term) and relationship weighting matrix (term x term) repeatedly. The weight application can be found here.
Naive Bag of Words
- Rationale: This was what Dane's version of ATHENA used. It's the standard for text-based classification.
- Gazetteer generation: See here. Briefly, we stemmed the full texts using the Porter Stemmer, ignored stop words, and allowed unigrams and bigrams. Both low- and high-frequency thresholds were applied with tf-idf extractor.
- Feature extraction: Standard tf-idf vectorizer pulling terms from the stemmed full text.
~~Title words~~
- Rationale: Titles are highly condensed summaries of papers.
- Gazetteer generation: Titles were pulled from PubMed using the PMIDs. Stop words were excluded, and title words had to occur at least 5 times across the corpus. Numbers were excluded.
- Feature extraction: Number of appearances of each title word in the gazetteer was counted and divided by number of words in the title.
~~Keywords~~
- Rationale: Author-generated keywords represent a (likely) domain-relevant set of words and terms. They are thus more likely to be relevant than a bag-of-words approach. Also, keywords can include longer terms (>2 words) than an n-gram method could reasonably extract.
- Gazetteer generation: Keywords were pulled from the PubMed metadata using PMIDs. Unfortunately, so few papers had keywords in the PubMed metadata that we had to supplement this gazetteer with something else, so we used the title words as well.
- Feature extraction: Terms were extracted from the full text using re.findall. Counts were normalized to 1 for each paper (like all other features).
~~Authors/Year~~
- Rationale: Scientists are often very focused in their research. These research trends can change over time.
- Gazetteer generation: Author names and year of paper publication were pulled from PubMed metadata using PMIDs. Authors and year had to occur at least five times in the corpus to be included in the gazetteer.
- Feature extraction: Counted using the same procedure as gazetteer generation. Counts were normalized to 1.
~~Journal~~
- Rationale: Journals are often very focused in content. We didn't think this feature would be useful alone, but we did think it would be able to supplement other features.
- Gazetteer generation: Journals were identified from PubMed using PMIDs. Journals had to appear at least 5 times in the corpus to be included in the gazetteer.
- Feature extraction: Same procedure as for gazetteer generation.
References
- Rationale: Papers tend to cite other scientific works about similar things, and papers citing similar sets of other works are likely to be about the same thing.
- Gazetteer generation: Jason wrote a bunch of regular expression code to identify unique references, which we put into a gaz. Unfortunately, regular expressions just weren't sufficient to accurate extract and identify the references, since references display a great deal of variability.
- Feature extraction: The same regular expressions as used for the gazetteer generation were used for feature extraction. Counts were normalized after extraction.

Here are the general steps to the pipeline we used:

Process corpus:
1. Extract text from pdfs using Java code if text-based and OCI if image-based.
2. Expand abbreviations in text using regular expressions written by Jason.
3. Convert British words to American form.
4. Created stemmed corpus in text/stemmed_full/ folder.
5. Separate references sections and save to text/references/ folder.
Process labels:
1. Convert labels to hierarchical form. This involves changing some labels that weren't hierarchical, but we thought should have been (e.g. Word Generation (Covert) became WordGeneration.Covert).
2. Convert csv files to matrix form (papers x terms).
  - This includes counting specific labels toward their parents (e.g. BehavioralDomain.Perception.Somesthesis.Pain counts toward BehavioralDomain.Perception.Somesthesis and BehavioralDomain.Perception).
3. Remove labels with fewer than X positive instances (we used 5 in our project but recommend 30).
4. Calculate dataset statistics:
  - Number of instances
  - Number of features
  - Number of labels
  - Label cardinality
  - Label density
  - Number of unique labelsets
Split corpus:
- Randomly split into training and test datasets (67/33).
- We made sure that each label occurred at least once (should be 10x) in the test dataset and twice (should be 10x) in the training dataset.
Generate gazetteer for each feature space (n=7) using the full dataset. <- Probably should have used just the training dataset.
Extract features for training and test datasets.
Run feature space selection with the training dataset.
1. Concatenate all combinations of feature count matrices.
2. Perform classification on each feature combination with "simple" multilabel classifier (sklearn's BR on top of linear SVC), using 10-fold cross-validation.
3. Evaluate each combination's performance using F1, precision, and recall.
4. Determine (qualitatively) the best performing combination with the fewest features. We chose Cognitive Atlas/NBOW/Title words.
Perform "real" classification:
1. Convert labels and features to meka format.
2. Run 5 classifiers using MEKA. Jason wrote the code to do this.
3. Take majority vote across classifiers.
4. Evaluate performance on test dataset with F1, precision, and recall.

NBCLab / athena

Describe v2.0. #21