GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 13 forks source link

Sklearn compatible transformer and parallelization #182

Closed dafajon closed 3 years ago

dafajon commented 3 years ago

When working on documents on a dataframe feature extraction requires a sklearn transformer so that the process can be a part of pipeline and serialized along with it. The issues so far:

husnusensoy commented 3 years ago

I see two different questions in here please split into two separate issues for us to proceed.

husnusensoy commented 3 years ago

Check develop branch for drop-in replacement

from sadedegel.dataset import load_raw_corpus
from sadedegel.extension.sklearn import TfidfVectorizer

tra = TfidfVectorizer()

X = tra.transform(load_raw_corpus())
dafajon commented 3 years ago

I checked on List and pd.DataFrame. Works fine. I will also use it with a pipeline. Will provide feedback. I had a FunctionTransformer implementation in sentiment work. Will update that accordingly when this is in the new release.

husnusensoy commented 3 years ago

Can you create a seperate issue for parallel processing. This issue is already closed with a commit currently available on develop branch.