Natural Language Processing based Package - ideas

kanishkamisra commented 5 years ago

I have two ideas that I believe would be a great addition to the NLP universe of packages in R:

1. Measuring quality of generated Natural Language texts in a reference-rich setting

The prime example here is Machine Translation and the BLEU collection of metrics. These metrics evaluate the quality of the generated text from a machine learning algorithm against human-annotated reference samples. An example is given as follows: Ref: Arthur Chan's presentation on Machine Translation, access it here

Now, BLEU isn't the best metric in the world and there are several alternatives that are closer to measuring the true quality of candidate texts, for a quick reference, look at Rachel Tatman's amazing blog post

I had plans to start working on something like this, and have an incomplete implementation ready here: https://github.com/kanishkamisra/footrulr

Some things I would love to work on in relation to this package are:

Finish implementing the BLEU metrics along with useful examples.
Implement alternatives to BLEU (which are more appropriate but seldom used).
Work on showing proof of concept by taking sample outputs from successful MT systems.

Some other cases where metrics like BLEU might be useful include Question Answering and Image Captioning!

2. Package for word and sentence representations

Word representations have become a fundamental entity used in modern, learning based NLP papers. In brief, these are dense vectors for each word which are learned from open text such that some semantic and morphosyntactic properties of these words are captured in these vector representations.

There is already a package that does this, thanks to Ben Schmidt, which provides functionality for training word2vec and loading pretrained vectors, but I was hoping to

maybe make the read times fast
make it easier to represent sentences as collections of word vectors so that doing learning over document level features can be made easier

If these modifications involve severe code reuse, we probably don't need to make a new package, we could just make a PR to wordVectors itself. But if we do write a lot of new code we can make a new package!

If anyone has any opinions on these ideas or maybe a different but interesting NLP based package idea, please let me know and we can all collaborate!

emilyriederer commented 5 years ago

These sounds really cool! I don't know a ton about NLP but was just listening to something about BLEU over the weekend and found it very interesting. Since you already have a package partly up and running, do you have any specific thoughts on how/where you would want collaborators to jump in?

No pressure, but if you want to think about some discrete tasks, put them up as issues, and tag them with Chicago R Unconference, this might make it easier for others to join in the fun! Otherwise, there will be plenty of brainstorming / project planning time on Saturday. 😄

kanishkamisra commented 5 years ago

That makes sense! I’ll work on it and post an update on slack soon. Thanks for notifying me on this :)

kanishkamisra commented 5 years ago

UPDATE Added some issues on for the BLEU package, check them out here: https://github.com/kanishkamisra/footrulr/issues

katherinesimeon commented 5 years ago

More broadly language-related, would love to develop something that converts text to international phonetic alphabet transcriptions (I don't know if this exists in R already) based on the CMU pronunciation dictionary.

kanishkamisra commented 5 years ago

@katherinesimeon Wow! I actually wanted to build a package on that (i kind of did, privately) but then I figured out this exists :(

Maybe we could work on augmenting it?

katherinesimeon commented 5 years ago

Ahh!! THANK YOU. This is what I'm looking for!! But YES I would love to work on augmenting this for sure 😃

kanishkamisra commented 5 years ago

@katherinesimeon we could look into this: https://phoible.org/ but I'm not sure what we could do apart from just adding data from the inventories

maurolepore commented 5 years ago

Somewhat related to https://github.com/chirunconf/chirunconf19/issues/18 and https://github.com/chirunconf/chirunconf19/issues/10 (particularly where @angela-li says "a Shiny app that helps you read and judge applications") I was wondering if there is a tool to measure how supportive (or unsupportive) a person tends to be on twitter towards the R community.

You know, some people are just a pleasure to "listen to" on twitter, and others are negative most of the time. I used to focus on problems (negative) myself and now I'm forcing myself to focus on solutions (positive). But we can only change what we can measure (paraphrasing something I read somewhere). I would help me, and maybe (unconf organizers?) to know what image I'm projecting to the world when I send a twitt with the #rstats hashtag or to people that are heavy users of the #rstats hashtag.

chirunconf / chirunconf19

Natural Language Processing based Package - ideas #7

1. Measuring quality of generated Natural Language texts in a reference-rich setting

2. Package for word and sentence representations