alexandrainst / danlp

DaNLP is a repository for Natural Language Processing resources for the Danish Language.
BSD 3-Clause "New" or "Revised" License
195 stars 33 forks source link

Scikit learn pipeline? #5

Closed hammurabi-ds closed 5 years ago

hammurabi-ds commented 5 years ago

Can i sequentially apply your nlp preprocessors in the scikit learn Pipeline? If not then i think its a an advantage for the package

hvingelby commented 5 years ago

Hi @Hamurabbi, we have not looked at supporting NLP preprocessing in scikit-learn. Can you maybe ellaborate on how you would like to use danish NLP models in scikit-learn?

hammurabi-ds commented 5 years ago

Hi. here is an example of how such a pipeline may look like (i have a similar package that i use which is not open source, but this is how it can look like). It should be fairly simple to built the wrappers and have them compatible with scikit learn pipeline.

        prep = Preprocessor('english')
        pip = Pipeline([
            ('word_token', WordTokenizer(prep)),
            ('punct', PunctuationRemover(prep)),
            ('pos', POSTagger(prep)),
            ('lemma', Lemmatizer(prep)),
            ('stopword', StopwordRemover(prep)),
        ])
        results = pip.fit_transform(RAW_TEXT)

This pipeline object may now be saved and reused.

hvingelby commented 5 years ago

Well thank you for the suggestion and clarification. We will look into it :)