NorskRegnesentral / skweak

skweak: A software toolkit for weak supervision applied to NLP tasks
MIT License
918 stars 73 forks source link

Error on documents without any spans #3

Closed chssch closed 3 years ago

chssch commented 3 years ago

Version: 0.2.9 Platform: Linux-4.4.0-176-generic-x86_64-with-debian-9.6 Python version: 3.6.7
Pipelines: en_core_web_md (3.0.0), en_core_web_sm (3.0.0)

If I have any document in my set that does not have any span detected (source), I get an error during the HMM model creation:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-8af531598451> in <module>
      7 #docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
      8 # And run the estimation
----> 9 docs = model.fit_and_aggregate(docs)

/backend/notebook/skweak/skweak/aggregation.py in fit_and_aggregate(self, docs, n_iter)
    300         labelling functions."""
    301 
--> 302         self.fit(list(docs))
    303         return list(self.pipe(docs))
    304 

/backend/notebook/skweak/skweak/aggregation.py in fit(self, docs, cutoff, n_iter, tol)
    353 
    354         # And add the counts from majority voter
--> 355         self._add_mv_counts(docs)
    356 
    357         # Finally, we postprocess the counts and get probabilities

/backend/notebook/skweak/skweak/aggregation.py in _add_mv_counts(self, docs)
    530 
    531             # And aggregate the results
--> 532             agg_array = mv._aggregate(obs).values
    533 
    534             # Update the start probabilities

/backend/notebook/skweak/skweak/aggregation.py in _aggregate(self, obs, coefficient)
    229             return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
    230 
--> 231         label_votes = np.apply_along_axis(count_function, 1, obs.values)
    232 
    233         # For token-level segmentation (with a special O label), the number of "O" predictions

<__array_function__ internals> in apply_along_axis(*args, **kwargs)

/usr/local/lib/python3.6/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    377     except StopIteration:
    378         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    380 
    381     # build a buffer for storing evaluations of func1d.

/backend/notebook/skweak/skweak/aggregation.py in count_function(x)
    227             ar = x[x>=min_val]-min_val
    228 
--> 229             return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
    230 
    231         label_votes = np.apply_along_axis(count_function, 1, obs.values)

<__array_function__ internals> in bincount(*args, **kwargs)

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Current workaround is to filter these documents like this:

docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]

Thanks for this promising library. I worked on getting Snorkel ready for spaCy NER data labeling, but this one looks directly like a realy good fit.

plison commented 3 years ago

Thanks, I hadn't thought of testing this. It should now be fixed.