If I have any document in my set that does not have any span detected (source), I get an error during the HMM model creation:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-8af531598451> in <module>
7 #docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
8 # And run the estimation
----> 9 docs = model.fit_and_aggregate(docs)
/backend/notebook/skweak/skweak/aggregation.py in fit_and_aggregate(self, docs, n_iter)
300 labelling functions."""
301
--> 302 self.fit(list(docs))
303 return list(self.pipe(docs))
304
/backend/notebook/skweak/skweak/aggregation.py in fit(self, docs, cutoff, n_iter, tol)
353
354 # And add the counts from majority voter
--> 355 self._add_mv_counts(docs)
356
357 # Finally, we postprocess the counts and get probabilities
/backend/notebook/skweak/skweak/aggregation.py in _add_mv_counts(self, docs)
530
531 # And aggregate the results
--> 532 agg_array = mv._aggregate(obs).values
533
534 # Update the start probabilities
/backend/notebook/skweak/skweak/aggregation.py in _aggregate(self, obs, coefficient)
229 return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
230
--> 231 label_votes = np.apply_along_axis(count_function, 1, obs.values)
232
233 # For token-level segmentation (with a special O label), the number of "O" predictions
<__array_function__ internals> in apply_along_axis(*args, **kwargs)
/usr/local/lib/python3.6/site-packages/numpy/lib/shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
377 except StopIteration:
378 raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379 res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
380
381 # build a buffer for storing evaluations of func1d.
/backend/notebook/skweak/skweak/aggregation.py in count_function(x)
227 ar = x[x>=min_val]-min_val
228
--> 229 return np.bincount(ar, weights=weights, minlength=nb_obs_to_count)
230
231 label_votes = np.apply_along_axis(count_function, 1, obs.values)
<__array_function__ internals> in bincount(*args, **kwargs)
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
Current workaround is to filter these documents like this:
docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
Thanks for this promising library. I worked on getting Snorkel ready for spaCy NER data labeling, but this one looks directly like a realy good fit.
Version: 0.2.9 Platform: Linux-4.4.0-176-generic-x86_64-with-debian-9.6 Python version: 3.6.7
Pipelines: en_core_web_md (3.0.0), en_core_web_sm (3.0.0)
If I have any document in my set that does not have any span detected (source), I get an error during the HMM model creation:
Current workaround is to filter these documents like this:
docs = [d for d in docs if any([v for (k,v) in d.spans.items()])]
Thanks for this promising library. I worked on getting Snorkel ready for spaCy NER data labeling, but this one looks directly like a realy good fit.