TutteInstitute / vectorizers

Vectorizers for a range of different data types
BSD 3-Clause "New" or "Revised" License
93 stars 23 forks source link

NgramVectorizer(ngram_size=2) broken #111

Closed jc-healy closed 1 year ago

jc-healy commented 1 year ago

It looks like setting ngram_size to anything greater than 1 computes the column dictionary correctly but fails to fill any values into the _train_matrix. Current unit tests are not testing for this so we'll need to update them after fixing the error.

NgramVectorizer(ngram_size=1).fit_transform(data1)

<50x3210 sparse matrix of type '<class 'numpy.float32'>'
    with 7685 stored elements in Compressed Sparse Row format>

NgramVectorizer(ngram_size=2).fit_transform(data1)

<50x11659 sparse matrix of type '<class 'numpy.float32'>'
    with 0 stored elements in Compressed Sparse Row format>
jc-healy commented 1 year ago

This is addressed in PR #10. https://github.com/TutteInstitute/vectorizers/pull/110 I'll close the issue once the PR is merged.