Implement Term Frequency Inverse Document Frequency (TF IDF)

yooper commented 11 years ago

Implement the TF IDF algorithm. How do I get this issue assigned to me?

angeloskath commented 11 years ago

I think you need to have write access to the repository to be able to be assigned issues.

The difficulty in designing the above feature is where it fits, architecturally, with the rest of the code. Currently, each class that needs to process a set of documents receives a TrainingSet and a FeatureFactory. The feature factory transforms the document's data to a sparse vector implemented with php's associative arrays. TF-IDF, stemming tokens, removing stopwords, all those practically belong to the same place in the code since they do the same type work, data transformation/preprocessing.

I propose implementing the above as feature factories.

We could implement a FeaturePipeline that receives a set of feature factories and applies them consecutively. Then StemmedData, RemovedStopWords and TfIdf are pretty straight forward.

Regarding TfIdf, which is the more complex of the aforementioned classes. I suggest receiving a TrainingSet and a FeatureFactory as constructor parameters from which an Idf dictionary can be built. After that the getFeatureArray implementation is easy.

Looking forward to your thoughts on that. Thanks.

yooper commented 11 years ago

I agree on the need to develop a processing pipeline. The processing pipeline should have the following configurable/settable behaviors:

Normalizer -> I am thinking tools that make the raw text lower case, strip quotations etc.. or none
Stemmer -> Chose an existing stemmer or none
Tokenization -> Chose an existing tokenizer or default to the whitespace tokenizer
StopWord -> Chose an existing StopWord implementation or none,
Documents -> 1 or more documents
Algorithm -> An algorithm to apply to the documents.

Lets call this object/class "Pipeline", we can find a better name for it later. The Pipeline object must accept an algorithm. Interfaces will be really helpful here to make implementation and consistency easier. The output of the pipeline will vary, but that is to be expected since the algorithms will drive that data outputs.

I am unclear by what you mean by FeatureFactory. Is this a factory design pattern that accepts texts or tokens and returns a set of tokens that are of interest? I am not overly familiar with the terms in this domain so I might need additional references or more explanation. I usually try to read an article or two when I see a term I do not recognize, but this is all still very new.

The implementation details for TF-IDF can be better defined in future comments. Designing a processing pipeline that can be re-used seems to be a good starting point for this discussion.

On Thu, Sep 5, 2013 at 5:36 AM, Angelos Katharopoulos < notifications@github.com> wrote:

I think you need to have write access to the repository to be able to be assigned issues.

The difficulty in designing the above feature is where it fits, architecturally, with the rest of the code. Currently, each class that needs to process a set of documents receives a TrainingSet and a FeatureFactory. The feature factory transforms the document's data to a sparse vector implemented with php's associative arrays. TF-IDF, stemming tokens, removing stopwords, all those practically belong to the same place in the code since they do the same type work, data transformation/preprocessing.

I propose implementing the above as feature factories.

We could implement a FeaturePipeline that receives a set of feature factories and applies them consecutively. Then StemmedData, RemovedStopWords and TfIdf are pretty straight forward.

Regarding TfIdf, which is the more complex of the aforementioned classes. I suggest receiving a TrainingSet and a FeatureFactory as constructor parameters from which an Idf dictionary can be built. After that the getFeatureArray implementation is easy.

Looking forward to your thoughts on that. Thanks.

— Reply to this email directly or view it on GitHubhttps://github.com/angeloskath/php-nlp-tools/issues/12#issuecomment-23855142 .

angeloskath commented 11 years ago

This kind of pipeline is not what I was talking about in the previous comment.

Quoting wikipedia: "In machine learning and pattern recognition, a feature is an individual measurable heuristic property of a phenomenon being observed." you can read a bit more about FeatureFactories in the documentation.

A big part of machine learning work is feature engineering, inventing functions that measure an important property of the data regarding the task at hand. You can think of a feature factory as a class that receives a document and a category (the category can be null) and returns either which features where found or if those features have a value what was their value.

Normalizer, Stemmer, StopWord, Tf Idf all those can be implemented as feature factories. I am not certain that this would be the best place for them but they can be implemented and it makes sense up to a certain point. What they do is if you spot "Word" or "word" activate the feature "word".

yooper commented 11 years ago

In my opinion, normalizing, stemming, tokenization, and removing stop words should be separate processes that are de-coupled from feature factories. This de-coupling will keep them easy to use for other folks. We want to promote ease of use for general features that others can use in their own projects. Feature Factories are abstract and typically encompass / implement computational methods that require specialized knowledge.

As for the Pipe line, I think we should investigate how other projects approach this problem.

I am in agreement the TF IDF can be implemented as FeatureFactory. And I will begin working on that next.

angeloskath / php-nlp-tools

Implement Term Frequency Inverse Document Frequency (TF IDF) #12