CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
MIT License
556 stars 129 forks source link

Sample notebook reports errors #145

Closed Raychanan closed 2 years ago

Raychanan commented 2 years ago

Hi, many thanks for the development of the great package!

I'm trying to run this sample notebook Predicting Conversations Gone Awry With Convokit on Google Colab here.

I did no modifications except for the first chunk I added

! pip -q install convokit
! pip uninstall spacy -y
! pip install -U spacy==3.1.4
!python -m spacy download en_core_web_sm

However, an error occurred in the second cell from the bottom: TypeError: __init__() takes from 1 to 2 positional arguments but 3 were given. Would it be possible for you to point out how to correct the error? Many thanks!

Running prediction task for feature set politeness_strategies
Generating labels...
Computing paired features...
Using 38 features
Running leave-one-page-out prediction...
RemoteTraceback                           Traceback (most recent call last)
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.7/multiprocessing/", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-37-de914fca85cc>", line 11, in run_pred_single
    base_clf = Pipeline([("scaler", StandardScaler()), ("featselect", SelectPercentile(f_classif, 10)), ("logreg", LogisticRegression(solver='liblinear'))])
TypeError: __init__() takes from 1 to 2 positional arguments but 3 were given

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
[<ipython-input-38-9704095ec82e>](https://localhost:8080/#) in <module>()
      4 for combo in feature_combos:
      5     combo_names.append("+".join(combo).replace("_", " "))
----> 6     accuracy = run_pipeline(combo)
      7     accs.append(accuracy)
      8 results_df = pd.DataFrame({"Accuracy": accs}, index=combo_names)

6 frames
[<ipython-input-37-de914fca85cc>](https://localhost:8080/#) in run_pipeline(feature_set)
     97     y = labeled_pairs_df.first_convo_toxic.values
     98     print("Running leave-one-page-out prediction...")
---> 99     accuracy, coefs, scores, hyperparams, pvalue = run_pred(X, y, feature_names, labeled_pairs_df.page_id)
    100     print("Accuracy:", accuracy)
    101     print("p-value: %.4e" % pvalue)

[<ipython-input-37-de914fca85cc>](https://localhost:8080/#) in run_pred(X, y, fnames, groups)
     34     with Pool(os.cpu_count()) as p:
---> 35         prediction_results =, X=X, y=y), splits)
     37     fselect_pvals_all = []

[/usr/lib/python3.7/multiprocessing/](https://localhost:8080/#) in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    270     def starmap(self, func, iterable, chunksize=None):

[/usr/lib/python3.7/multiprocessing/](https://localhost:8080/#) in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    659     def _set(self, i, obj):

[/usr/lib/python3.7/multiprocessing/](https://localhost:8080/#) in worker()
    119         job, i, func, args, kwds = task
    120         try:
--> 121             result = (True, func(*args, **kwds))
    122         except Exception as e:
    123             if wrap_exception and func is not _helper_reraises_exception:

[/usr/lib/python3.7/multiprocessing/](https://localhost:8080/#) in mapstar()
     43 def mapstar(args):
---> 44     return list(map(*args))
     46 def starmapstar(args):

[<ipython-input-37-de914fca85cc>](https://localhost:8080/#) in run_pred_single()
      9     y_train, y_test = y[train_idx], y[test_idx]
---> 11     base_clf = Pipeline([("scaler", StandardScaler()), ("featselect", SelectPercentile(f_classif, 10)), ("logreg", LogisticRegression(solver='liblinear'))])
     12     clf = GridSearchCV(base_clf, {"logreg__C": [10**i for i in range(-4,4)], "featselect__percentile": list(range(10, 110, 10))}, cv=3)

TypeError: __init__() takes from 1 to 2 positional arguments but 3 were given
jpwchang commented 2 years ago

Hi @Raychanan,

It appears that this is caused by a change to scikit-learn's SelectPercentile class in the 1.x scikit-learn release. I've committed an updated version of the notebook to deal with this change.

The change is small, so if you don't want to re-upload the notebook to colab from scratch, you can simply change one line in your existing colab notebook. Find the following line:

base_clf = Pipeline([("scaler", StandardScaler()), ("featselect", SelectPercentile(f_classif, 10)), ("logreg", LogisticRegression(solver='liblinear'))])

And change it to:

base_clf = Pipeline([("scaler", StandardScaler()), ("featselect", SelectPercentile(score_func=f_classif, percentile=10)), ("logreg", LogisticRegression(solver='liblinear'))])

That should resolve the error!

Raychanan commented 2 years ago

This helps a lot! Thanks so much!