NirantK / Hinglish

Hinglish Text Classification
MIT License
30 stars 10 forks source link

Make submissions to Codalab with NB-SVM and LR-TF-IDF #4

Closed NirantK closed 4 years ago

NirantK commented 4 years ago
NirantK commented 4 years ago

Hmm, please raise a separate PR for this issue from #10?

It'd have been easier to discuss #10 when it's just URL related things. Nahi?

Anyway, on the code in #10 - the cleanlab piece throws this error:

NB-SVM Regular Train Labels confident learning (noise matrix given), 

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\nirantk\Miniconda3\envs\toys\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\nirantk\Miniconda3\envs\toys\lib\multiprocessing\pool.py", line 44, in mapstar
    return list(map(*args))
  File "C:\Users\nirantk\Miniconda3\envs\toys\lib\site-packages\cleanlab\pruning.py", line 109, in _prune_by_count
    noise_mask = np.zeros(len(psx), dtype=bool)
NameError: name 'psx' is not defined
"""

The above exception was the direct cause of the following exception:

NameError                                 Traceback (most recent call last)
<ipython-input-17-892a0b328c2f> in <module>
      2 NBSVM_Confident = NbSvmClassifier(C=4, dual=False, n_jobs=-1)
      3 rp = LearningWithNoisyLabels(clf = NBSVM_Confident)
----> 4 _ = rp.fit(tfidf_train, y_train, noise_matrix=noise_matrix)
      5 pred = rp.predict(tfidf_test)
      6 print("test accuracy:", round(accuracy_score(pred, y_test),5))

~\Miniconda3\envs\toys\lib\site-packages\cleanlab\classification.py in fit(self, X, s, psx, thresholds, noise_matrix, inverse_noise_matrix)
    295             inverse_noise_matrix = self.inverse_noise_matrix,
    296             confident_joint = self.confident_joint,
--> 297             prune_method = self.prune_method,
    298         ) 
    299 

~\Miniconda3\envs\toys\lib\site-packages\cleanlab\pruning.py in get_noise_indices(s, psx, inverse_noise_matrix, confident_joint, frac_noise, num_to_remove_per_class, prune_method, sorted_index_method, multi_label, n_jobs, verbose)
    334                     )
    335                 else:
--> 336                     noise_masks_per_class = p.map(_prune_by_count, range(K))
    337         else:  # n_jobs = 1, so no parallelization
    338             noise_masks_per_class = [_prune_by_count(k) for k in range(K)]

~\Miniconda3\envs\toys\lib\multiprocessing\pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

~\Miniconda3\envs\toys\lib\multiprocessing\pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

NameError: name 'psx' is not defined
meghanabhange commented 4 years ago

I'm not able to reproduce the error :(

NirantK commented 4 years ago

This seems to be Windows specific issue. Let me raise it up with cleanlab itself.

NirantK commented 4 years ago

Make Experiment Reproducible

On your machine: test accuracy: 0.60329 On my machine: test accuracy: 0.60071

This is not good 😢

How can we make this number reliable?

  1. Refer #6

  2. Please stop using NLTK. Why are we using so many tokenizers which slow down the pipeline? There is a regex in the CleanTwitter class -- why? Shouldn't that regex be in Cleaning step?

  3. When I said retain mentions and hashtags: You could also convert #yahoo to hashtag yahoo for language modelling purposes.

Same for mentions, if it's RT @narendramodi @nirant - remove both. If it's in the middle of the tweet: retain it. E.g. Hello @meghanabhange then convert to Hello mention meghanabhange

meghanabhange commented 4 years ago

Yes, this makes sense. I'll get rid of the NLTK and make the pipeline faster.