WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
118 stars 20 forks source link

Culling by least frequent words #1045

Open scottkleinman opened 4 years ago

scottkleinman commented 4 years ago

It would be really nice to create a feature that does the opposite of most frequent word culling so that people can study low-frequency words in their collections. Adding this to the back end should be pretty easy; the challenge will be the UI. I'm tempted to say that we just make the tooltip advise the user to enter a negative number (right now that generates a validation error).

WeaselPeasel commented 3 years ago

Hi Scott! Myself and a couple peers are currently making the changes for this enhancement, but the bare minimum is already done (user enters negative number and the corpus is culled based on least frequent words). There was an error where the data would not be successfully returned to the front end unless the user selected to cull by number of documents a word was present in. I am curious as to how exactly this feature should work, should the user only be able to cull either by least frequent OR by most frequent or can they be combined somehow? Secondly, how do we make sure the user ensures that all the least frequent words are present in at least some of the documents? Thanks for any feedback and guidance! (The current implementation can be found in branch "SS2021/cull_by_least_frequent")

scottkleinman commented 3 years ago

I may be answering your question too late at night, and I haven't looked at the cull_by_least_frequent branch yet, but here are some initial thoughts.

The suggestion to use a negative number was a bit desperate, but prompted by the lack of screen real estate. Now that I think about it, perhaps the easiest UI solution to implement would to replace "Use the top" with a select menu that also includes "Use the bottom". (Side issue: would "most frequent/least frequent" fit instead of "top/bottom"?) Possibly a switch or toggle button would do. In other words, you can combine the UI controls, but I think culling based on most frequent and least frequent words together would be odd. Conceivably, you might choose this combined set as the "most distinctive" words in corpus, but Top Words is really the "go to" tool for that. (If you really needed to, you could download the DTM from tokenise, compile a Keep Words list and then scrub everything else out from your corpus to achieve the effect of culling everything but most and least frequent words.)

I'm a little concerned by your second question because it seems to me that the same problem could potentially occur in the most frequent words scenario (although it is less likely). Perhaps this concern is just because I have not looked at how the most frequent words function is implemented, but it may be that it needs some more rigorous testing. What exactly happens if you select the 25 most frequent words for each document and no term is shared by any document. Do you keep dipping into the documents until you compile 25 terms that are shared by however many documents you have specified? (This sounds like a nightmare of an algorithm to me, but maybe somebody has done it.) Or do you just return an error message stating that none of the top 25 terms in any of the documents was shared by N documents. I think the latter is what users will expect. I'd imagine a similar procedure for least frequent words.

I hope I've correctly understood your questions and provided at least some guidance. I'll try to look at the code sometime tomorrow.