Closed mentoc3000 closed 10 months ago
Yes, indeed that was overlooked, thanks a lot for finding this out! The error occurs when performing the sparse cosine similarity calculation. The data is stored in a sparse matrix without any indexes, the indexes are lost with this step and just a new index starting at 0 is assigned. I think it would be good to alter it in the code to have it returned the actual indexes, rather then the a new row number. I will have a look where would be the best place to substitute the index back in
First, thanks for a great package! I've found it very useful.
I've been using it with some data that has non-sequential indices, which causes the name matching to fail. See the example below. It looks like there's an implicit assumption that the indices of
df_companies_a
are sequential integers starting from 0.I haven't looked into this issue in detail, but might it be caused by flattening the data into
_vec
to speed up the ngram matching? If that's the case and there's no work around for the indices, a heads up in the documentation would be helpful.