CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Decoupling language model computes from within Surprise #187

Open TushaarGVS opened 1 year ago

TushaarGVS commented 1 year ago

Description

This change attempts to decouple the language model computes out of the Surprise transformer, so as to enable more flexibility in choosing several other language models, including KenLM, and in enhancing speed of these computations. Furthermore, this change introduces a LanguageModel class, and also implements two LanguageModel-inherited classes, including modifying the existing cross-entropy function to the ConvoKitLanguageModel class and the Kenlm class that uses the KenLM model. Furthermore, the computation of surprise has been thread-parallelized, and the language model evaluation functions themselves have also been thread-parallelized for runtime efficiency (although, this efficiency is quite minimal since we use threads and Python employs GIL to disallow for extensive thread-parallelization; any speed-up observed can be attributed to file I/O operations and NumPy operations; this is to be explored further).

The demos/surprise_demo.ipynb and demos/tennis_demo.ipynb demo files have been updated to reflect the changes made. To ensure most possible backward compatibility, only the smooth argument has been removed from the Surprise class signature; the fit and transform methods still run with the same arguments (use kwargs to input arguments pertaining to added functionality). Use LanguageModel.config to view the config of individual LanguageModel classes, and LanguageModel.type to view the type of the language model.

Other changes include code modularization, utils addition, and other code formatting. TQDM imports have been modified to enable notebook rendering when using notebooks, and additional TQDM (disappearing) bars have been included to track the progress of surprise computations.


Motivation and context

This change was made to enable flexibility in language modeling. Currently, we use cross-entropy as a measure of how surprising the conversations are; however, it needn't be the case that cross-entropy has to be used. One may choose to employ perplexity, or KenLM log-probability scores. In this regard, decoupling surprise (language model) computes are of importance. This changes attempts to achieve the same.


How has this been tested?

The code has been tested by comparing this implementation to an earlier implementation, and with the same preprocessing, the same results were observed on the tennis dataset (working on the utterance-level). Furthermore, the cross-entropy refactoring has been tested using the old ConvoKit surprise demo, and (with some randomness in choosing the context and target) the outputs seem to be about the same, i.e., almost the same ordering in most and least surprising involvements. In the tennis demo, with the use of cross-entropy, the values remained the same.


Usage

Create a new LanguageModel class object as follows:

convokit_lm = convokit.ConvoKitLanguageModel(smooth=True, n_jobs=1)  # default language model
kenlm = convokit.Kenlm(kenlm_path='/Users/tushaar/kenlm', 
                       model_filename='kenlm_surprise',
                       trained_model_filepath='/Users/tushaar/Desktop/kenlm_models/surprise.arpa', 
                       is_persistent=True)

Use the created LanguageModel when transforming as follows:

surprise.transform(corpus, language_model=kenlm, eval_type='cross_entropy')

Pending

Documentation and final testing (estimated: December 15, 2022).

jpwchang commented 1 year ago

Thanks for the contribution! There are a few changes needed to get everything working with our github infrastructure:

In addition to the above, I have a few further minor suggestions related to style and consistency:

Thanks again for the contribution, hopefully these asks aren't too much. Let me know if anything is unclear!

cristiandnm commented 1 year ago

Thanks Jonathan, very good points. One nore: I would like to not force KenLM as a requirement fit convokit. I think if LM modules like forecaster modules; so the same way we don’t force tensorflow for convokit, we should not force KenLM. This should be documented properly (as I believe we do that for the CRAFT forecaster module).

On Wed, Dec 14, 2022 at 16:09 Jonathan Chang @.***> wrote:

Thanks for the contribution! There are a few changes needed to get everything working with our github infrastructure:

  • The KenLM code introduces a new dependency on (presumably) the kenlm python package. This dependency needs to be added to setup.py so that it gets automatically installed alongside convokit; right now the automated tests cannot run since this dependency is missing.
  • Please automatically format the new code with the Black formatter to maintain consistency with the style guidelines. Instructions can be found in the contribution guidelines https://github.com/CornellNLP/ConvoKit/blob/master/CONTRIBUTING.md#contributing-code

In addition to the above, I have a few further minor suggestions related to style and consistency:

  • In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"
  • I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above
  • This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise
  • I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

Thanks again for the contribution, hopefully these asks aren't too much. Let me know if anything is unclear!

— Reply to this email directly, view it on GitHub https://github.com/CornellNLP/ConvoKit/pull/187#issuecomment-1352005211, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCMFE5RG5J2PNFRWE2BVSLWNILPHANCNFSM6AAAAAASZU7PRY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TushaarGVS commented 1 year ago

Thanks for the extensive review, Jonathan, I will address the comments and post my questions before I push the final version of my changes for a PR.

TushaarGVS commented 1 year ago

In the abstract class specification for LanguageModel, the "main" function is called cross_entropy. But this is misleading since, as you note, the whole point of abstracting the language model is to allow the use of metrics other than cross entropy. This leads to the extremely confusing situation where the KenLM subclass implements cross_entropy, even though the implementation is (presumably) not actually computing cross entropy. We should therefore use a more generic name such as "score"

Jonathan, the "main" function is called evaluate(..., eval_type='') that takes eval_type as one of the arguments, where the user can specify the name of their evaluation function, and then that function gets called using eval_fn = getattr(self, eval_type) in evaluate(...). The function cross_entropy is a dummy function that is shown in LanguageModel class, and it really needn't be implemented by the user, if not used (hence only a RuntimeError, which throws an error only upon usage).

Also, KenLM returns log-likelihood score, which is essentially cross-entropy that we compute; hence, both our model and KenLM compute cross-entropy.

This pull request adds two new files file_utils.py and utils.py; I wonder if we should just merge their contents into the top-level util.py that already exists in ConvoKit. This would reduce redundancy, and having taken a look at the new files I can easily imagine their functions being useful in places outside of Surprise

This is a great idea, I will merge utils into the main utils file.

I'm not a big fan of the name ConvoKitLanguageModel since it's a bit too nonspecific. In general, we should try to name the classes such that they are self-descriptive; i.e., a user could make an educated guess at what the class does by looking at the name. Since the purpose of the class is to implement the basic cross entropy score that used to be built in to Surprise, I might suggest renaming the class as CrossEntropyLanguageModel (after having resolved the other issue relating to the use of the term "cross entropy" as mentioned above

It was initially called CrossEntropy, but based on my answer above, it seems too restrictive to call it CrossEntropyLanguageModel. I think we should come up with a better name.

I've noticed that the naming of private methods in the new code seems inconsistent, some use double underscores and others use a single underscore. We should standardize on a single naming convention; single underscore is used elsewhere in the code so we should stick with that

I have used _ to indicate semi-private variables; variables that aren't supposed to be accessible outside the class and __ for true private variables; those that can't be accessed (even if you wanted to) outside the class (at least not without _class_name__variable_name). I want to leave that functionality as is, because I think it is a good programming practice, but let me know if this is a big issue concerning inconsistency, I will change it accordingly.

TushaarGVS commented 1 year ago

Thanks Jonathan, very good points. One note: I would like to not force KenLM as a requirement fit convokit. I think if LM modules like forecaster modules; so the same way we don’t force tensorflow for convokit, we should not force KenLM. This should be documented properly (as I believe we do that for the CRAFT forecaster module).

I have addressed this, Cristian. The code only installs kenlm when installed with pip install convokit[kenlm].

TushaarGVS commented 1 year ago

I have finished documenting all the changes made, all the classes and functions now have extensive documentation. Also, following Jonathan’s comments, I have moved utility functions to the utils file. Following your comment, I have ensured that kenlm is not a direct dependency of convokit, instead, it can be used by using pip install convokit[kenlm].

Additionally, I have also implemented perplexity() function for both the language model classes (ConvoKitLanguageModel and Kenlm), in addition to cross_entropy(). I have also run both the demos with the latest changes to ensure that I have not introduced any breaking changes.