archivesunleashed / auk-notebooks

Jupyter notebooks to assist in creating additional analysis and visualizations of Archives Unleashed Cloud derivatives.
https://cloud.archivesunleashed.org
Other
11 stars 5 forks source link

Add user stop words Madlib. #45 #48

Closed greebie closed 5 years ago

greebie commented 5 years ago

This PR gives the user the options to use NLTK stopwords and/or add their own. This was a common demand from the datathon.

I also removed the International () function call because it is not doing what I originally thought it was doing.

greebie commented 5 years ago

Okay. Will fix.

I knew you were going to say that about international(), but I just couldn't bear seeing it just be wrong and out there in the open for such a small change. Also, I knew that #42 was still in the issues, so the tracking is still available.

greebie commented 5 years ago

Latest commit should resolve all issues stated above.

greebie commented 5 years ago

I think that's a different issue which might be fixed more generally in another branch, especially if we look at making a module to cover the functions.

Essentially we have general madlibs and more specific madlibs and it's not always clear which is which. I don't think that it's that helpful to tell people to scroll up to change something. Let's look at the overall explanations / docs / comments together as a separate issue. (Maybe after receiving feedback from the datathon as well.)

ianmilligan1 commented 5 years ago

OK. I still don't think

vals = [x[1] for x in tokens if x[0] not in STOP_WORDS] # Filter based on STOP_WORDS (above).

is clear? Filter what? Filter in? Filter out? Can you try to take a kick at the can to make it a bit more usable?

ruebot commented 5 years ago

I think that's a different issue which might be fixed more generally in another branch, especially if we look at making a module to cover the functions.

No. That's in scope for this PR. Ian's suggestion makes sense. Can you update your PR to include that, and we can continue to iterate from there.

greebie commented 5 years ago

Now that I've been able to look at this more closely, I see your point. I have some follow-up work this morning, but will get this fixed in the afternoon.