Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
https://yomguithereal.github.io/talisman/
MIT License
704 stars 47 forks source link

Parts of Speech tagging? #147

Open giorgio79 opened 6 years ago

giorgio79 commented 6 years ago

Would love to do to POS tagging with this lib Maybe integrate with others? https://github.com/FinNLP/en-pos

Yomguithereal commented 6 years ago

Hello @giorgio79. There is an experimental version of the averaged perceptron used by spacy here. It's undocumented but it should work. On a side note, I am currently thinking of refocusing of fuzzy matching/clustering with this library and drop hard NLP tasks because I don't have much time. But I'd love to speak with you about what you thinks you'd prefer use this lib to perform POS tagging rather than using the one you mention here.

giorgio79 commented 6 years ago

Thx @Yomguithereal ! Js nlp libs are ripening super fast, I am currently evaluating myself the options, such as

Joining forces would be a great way forward to avoid duplicated efforts. Have you thought of combining with some of the others? Otherwise, doing spacy in javascript sounds fantastic, but as you say a massive undertaking. At the moment, Natural seems to do a lot that I need already, and I just thought I give a quick go to others like Talisman.

Yomguithereal commented 6 years ago

As much as I'd love to add my stone to js's hard nlp libraries I feel that my edge is much more fuzzy matching/clustering unfortunately. Google Refine-like stuff for instance & custom search engines.

Yomguithereal commented 6 years ago

Basically, my strategy for the future will probably to drop pos tagging / machine learning classifiers stuff and focus on fuzzy clustering, distance metrics, keyers, phonetic algorithms, stemmers, and tokenizers. But I'd be willing to help other libraries scavenge what they could use from me related to nlp such as the pos tagger, sentence tokenizer (punkt notably).

giorgio79 commented 6 years ago

Yeah, avoid reinventing the wheel where possible. Eg NaturalNode has tons of tokenizers already here https://github.com/NaturalNode/natural#tokenizers