TeamHG-Memex / sklearn-crfsuite

scikit-learn inspired API for CRFsuite
426 stars 215 forks source link

Sequence labelling issue: The numbers of items and labels differ... #64

Closed chriswales95 closed 3 years ago

chriswales95 commented 3 years ago

Hi, I'm trying to use sklearn-crfsuite for sequence labelling.

when running crf.fit(train_data, train_targets) on my data, I get the below stack trace:

Traceback (most recent call last):
  File ".../argument_segmenter.py", line 49, in train
    crf.fit(train_data, train_targets)
  File "/usr/local/lib/python3.9/site-packages/sklearn_crfsuite/estimator.py", line 314, in fit
    trainer.append(xseq, yseq)
  File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
ValueError: The numbers of items and labels differ: |x| = 40, |y| = 38

I noticed in https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/20 that someone suggests using a custom scorer, but I don't seem to get past the fitting stage.

Any advice would be appreciate.

My code looks like this:

train_data, test_data, train_targets, test_targets = load_data()

train_data = [sent2features(s) for s in train_data]
train_targets = [sent2labels(s) for s in train_targets]

test_data = [sent2features(s) for s in test_data]
test_targets = [sent2labels(s) for s in test_targets]

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

try:
    crf.fit(train_data, train_targets)
except Exception as e:
    logging.error(e)
chriswales95 commented 3 years ago

Closing this issue after having solved it after figuring it out. It was to do with my data processing. My own issue :)

Leaving this here for people to potentially help others. Review your data!

coolsubbu commented 2 years ago

Hi @chriswales95 ,

facing the same issue. what was the error in your data processing? how did you solve it?

Thanks Yogesh

chriswales95 commented 2 years ago

Hi @chriswales95 ,

facing the same issue. what was the error in your data processing? how did you solve it?

Thanks

Yogesh

Hi Yogesh,

I can't remember exactly how I fixed it, but I think it was the shape of the data I was giving it that was the issue.

If you're still having problems, I can double check how I was doing it beforehand and try and give some suggestions.

Let me know!

Chris

remo-help commented 1 year ago

@coolsubbu I'm commenting here for future people who run into this issue since it was not explained here.

This happens if you pass CRFsuite data in a 1-D array or a list of dics.

It expects a list of lists both for your X input data and y labels. See the source code:

` def fit(self, X, y, X_dev=None, y_dev=None): """ Train a model.

    Parameters
    ----------
    X : list of lists of dicts
        Feature dicts for several documents (in a python-crfsuite format).

    y : list of lists of strings
        Labels for several documents.

`