Customize Stopwords - Githubissues

nkmeyers commented 4 years ago

We need a way for users to customize the stopwords list and or swap in their own for use by the various NLP processes that check a stopwords list. @ericleasemorgan I think this enahncement relies on us making some stopwords documentation(showing which processes read a stop words list and where the default is and what is in it) and how to customize or swap out the default list(s)? Then maybe also include some UI on the airavata job input that lets a user indicate if they want to use a customized stopwords list(s) and submit that along with their input target(s)?

FYI this ticket spurred by @ralphlevan 's test Pratchett Carrel.

ralphlevan commented 4 years ago

I think we need more than just stopwords. The huge number of "he said" and "she said" suggest that we may want some n-gram exclusion patterns. That would be a nice thing for users to add as part of an iterative refinement of their product

molikd commented 4 years ago

Could we automate stopwords? Open question. The 1000 most common for sure. But proper nouns are very difficult.

The problem is that you can develop so many stopwords based on frequency of occurrence, but at some-point you would have to spill over to entity detection.

Imagine some kind of machine learning algorithm, it is trained to find stopwords in a corpus of documents. The algorithm starts by using the most common words, than maybe a dictionary, but you still get "Elsiver" in your results. So you add "Elsiver" to the dictionary, but then you get "PLoS" so you add that... and so on. The problem is that you, the user, are deciding to add these words to the list, you know that Elsiver and PLoS are publishers, but how does the machine learning algorithm know to get rid of publisher proper nouns and not others like "Covid-19" or "Manhattan Plot."

Could we go to the context in which the word sits? That's how we, as humans, know to get rid of a word, but you'd probably have to rely on some Neural Net black-box. Better to be on some kind of explainable statistical model, so, is there something about stopwords that tells them apart from other words? Maybe. maybe among stop words associated words there are some details, let's say we only care about a stopword if it is causing some problem in clustering or factorization. So we build a cluster of associated words from each of the words that are separating clusters/topics, and we test these associated words for relevance? maybe if assume that all words that are separating clusters are going to have similar associated words and if they don't they're no longer talking about the same things (on some threshold of similarity), so we remove them. It'd basically just be training topic modeler LDA to not be perfect, which I think is what we want.

ericleasemorgan commented 4 years ago

I am not ignoring y'all. Interesting discussions, and I encourage them to continue.

Concurrently, we need to: 1) get the whole thing running, and 2) then do enhancements. When it comes to Item #1, we have to:

harvest/cache the data set (done)
stuff the result into a database (done)
enhance the database with additional content (all but done)
index the database (all but done)
make it easy for Team CORD to create study carrels (half done)
make many carrels (barely started)
create a Web presence (almost done)

Once we get that far, which I anticipate will be by next Friday, we can go for enhancements, and there are many possibilities:

add long titles to list of carrels
allow people other than Team CORD to create study carrels
create a study carrel out of the whole of CORD, which requires scalability
create better stop word list
enable the whole "library" to be re-created
enhance author names with corresponding ORCHIDs
enhance Web presence with additional logos and attributions
extract additional grammars
figure out a way to dynamically create stop word list
generate additional measures of the documents
hyperlink bibliographic items to full text and other things
illustrate relationships using a network diagram
improve topic modeling
index study carrels
make everything FAIR
plot results on a map
plot results on a time line
refine entity output

As we enhance, we will repeatedly go back to Step #6 and re-build study carrels over and over, thus the carrels will be in a state of "continuous improvement".†

The whole thing is like playing guitar. First you need to learn how hold it. Then you need to learn how to tune it. Then you need to learn a few chords. After that you need to learn how to "keep time". Once you get that far, then you can concentrate to bending notes, advance to finger picking, playing syncopation, experiment with alternative tunings, moving the chords up and down the fret board, improvising, playing in various styles, performing, recording, etc.

We are getting there. I assure you. Please continue to discuss all of these things, and once we get the Reader running, we will prioritize enhancements, divvy up the work, and make the whole something we can be proud of.

† I can't believe I actually used that phrase.

-- Eric M.

ralphlevan commented 4 years ago

I absolutely understand and think I understand the priorities. Get it working first! Lipstick later.

ericleasemorgan commented 4 years ago

Who are "the users"?

ericleasemorgan / reader

Customize Stopwords #58