cadmiumcr / cadmium

Natural Language Processing (NLP) library for Crystal
https://cadmiumcr.com
MIT License
205 stars 15 forks source link

Text summarizers #18

Closed rmarronnier closed 5 years ago

rmarronnier commented 5 years ago

I've looked at sumy and implemented the Luhn method in Crystal using Cadmium.

I'm planning to implement more methods.

Would you be interested in a PR adding a text summarizer module to Cadmium ?

watzon commented 5 years ago

I'd definitely be interested in seeing something like sumy implemented in Cadmium. Seeing as there are multiple approaches that we may want to implement I'd do what has been done with the Tokenizer and various other pieces of Cadmium and include Summarization (or something similar) as its own abstract class with Luhn as a subclass. That way we can have a nice base API to use for all text summarizers.

rmarronnier commented 5 years ago

Great ! That what I was planning :-D

Almost OT : As I was trying to implement another summarization method, I went through the tfidf.cr Class.

  1. This is the only place in Cadmium where @documents : Array(Document) and its relevant methods (add_document, build_document, ...) are declared/used. If we want the same document handling logic elsewhere, maybe we should abstract it out. WDYT ?
  2. Maybe add a @corpus variable and its getter method to get the documents merged and its relevant computed values (ie : a term frequency in a document will be obviously different from its frequency value in the full corpus which contains the document)
watzon commented 5 years ago

I'm all for abstracting out something that can be used elsewhere

rmarronnier commented 5 years ago

Ok. I'll wrap my head around it.