IBMStreams / administration

Umbrella project for the IBMStreams organization. This project will be used for the management of the individual projects within the IBMStreams organization.
Other
19 stars 10 forks source link

Proposal for new project: streamsx.text #96

Closed markheger closed 7 years ago

markheger commented 7 years ago

Introduction

This toolkit has been created on demand of project requirements to provide alternative operations for text analysis, like lemmatization and text annotation with Uima Ruta scripts or existing project specific Uima pear files. The code is currently located in enterprise git: https://github.ibm.com/mark-oliver-heger/texttoolkit

Dakshi Agrawal suggested to publish this toolkit on public git.

Proposal

I would like to propose that a new repository and toolkit be created for the new text analysis operators.

I propose that the repository be called streamsx.text and that the toolkit be called com.ibm.streamsx.text

Initial contribution

The toolkit will initially contain a set of C++ and Java operators to support the following features:

Lemmatize

A lemma is the canonical form, dictionary form, or citation form of a set of words. In English, for example, run, runs, ran and running are forms of the same lexeme, with run as the lemma. This functionality is covered with operator Lemmatizer. (C++ operator and native functions)

Part-of-speech tagging

Part-of-speech tagging (aka POS tagging, POST, grammatical tagging) is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. This functionality is part of operator Lemmatizer. (C++ operator and native functions)

Stop Word removal

Stop Words are words which do not contain important significance to be used in Search Queries such as the, and, ... This functionality is part of operator DictionaryFilter. (C++ operator)

Dictionary Filter

In the opposite to stop word removal it is sometimes useful to only keep the words from a text that are in a dictionary. This functionality covered with operator DictionaryFilter. (C++ operator)

TF-IDF

Tf-idf is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It needs a trained model. The training could run in Streams using operator IdfCorpusBuilder. But the trained model could also come from external sources. This TF-IDF functionality is covered with operator TfIdfWeight. (C++ operator)

N-Grams

N-Grams of texts are a set of co-occuring words within a given window. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on. This functionality is covered with operator Ngram. (C++ operator and native function)

Rule-based Text Annotation

Support text annotation and analysis using Apache UIMA Ruta scripts or UIMA Analysis Engines coming in a .pear file. This functionality is covered with operators UimaText, UimaCase, RutaText and RutaCas. (Java operators)

Classification

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes. The training could run in Streams using operator LinearClassificationModelBuilder. But the trained model could also come from external sources. The classification is covered with operator LinearClassification. (SPL Composites and Python scripts)

Content Ranking

Content Ranking tries to figure out the intend of a text in relation to a field of interest. This requires to train a model before. The training could run in Streams using operator ContentRankingModelBuilder. But the trained model could also come from external sources. The content ranking is covered with operator ContentRanking. (SPL Composites and Python scripts)

leongor commented 7 years ago

+1

2016-09-15 14:38 GMT+03:00 markheger notifications@github.com:

Introduction

This toolkit has been created on demand of project requirements to provide alternative operations for text analysis, like lemmatization and text annotation with Uima Ruta scripts or existing project specific Uima pear files. The code is currently located in enterprise git: https://github.ibm.com/mark-oliver-heger/texttoolkit

Dakshi Agrawal suggested to publish this toolkit on public git. Proposal

I would like to propose that a new repository and toolkit be created for the new text analysis operators.

I propose that the repository be called streamsx.text and that the toolkit be called com.ibm.streamsx.text Initial contribution

The toolkit will initially contain a set of C++ and Java operators to support the following features:

Lemmatize

A lemma is the canonical form, dictionary form, or citation form of a set of words. In English, for example, run, runs, ran and running are forms of the same lexeme, with run as the lemma. This functionality is covered with operator Lemmatizer. (C++ operator and native functions)

Part-of-speech tagging

Part-of-speech tagging (aka POS tagging, POST, grammatical tagging) is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. This functionality is part of operator Lemmatizer. (C++ operator and native functions)

Stop Word removal

Stop Words are words which do not contain important significance to be used in Search Queries such as the, and, ... This functionality is part of operator DictionaryFilter. (C++ operator)

Dictionary Filter

In the opposite to stop word removal it is sometimes useful to only keep the words from a text that are in a dictionary. This functionality covered with operator DictionaryFilter. (C++ operator)

TF-IDF

Tf-idf is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. It needs a trained model. The training could run in Streams using operator IdfCorpusBuilder. But the trained model could also come from external sources. This TF-IDF functionality is covered with operator TfIdfWeight. (C++ operator)

N-Grams

N-Grams of texts are a set of co-occuring words within a given window. When N=1, this is referred to as unigrams and this is essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3 this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so on. This functionality is covered with operator Ngram. (C++ operator and native function)

Rule-based Text Annotation

Support text annotation and analysis using Apache UIMA Ruta scripts or UIMA Analysis Engines coming in a .pear file. This functionality is covered with operators UimaText, UimaCase, RutaText and RutaCas. (Java operators)

Classification

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes. The training could run in Streams using operator LinearClassificationModelBuilder. But the trained model could also come from external sources. The classification is covered with operator LinearClassification. (SPL Composites and Python scripts)

Content Ranking

Content Ranking tries to figure out the intend of a text in relation to a field of interest. This requires to train a model before. The training could run in Streams using operator ContentRankingModelBuilder. But the trained model could also come from external sources. The content ranking is covered with operator ContentRanking. (SPL Composites and Python scripts)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/IBMStreams/administration/issues/96, or mute the thread https://github.com/notifications/unsubscribe-auth/AGvlA34CP3R2E_uC0jLwu5l51P-evvVNks5qqS5SgaJpZM4J9xOS .

Best regards, Leonid Gorelik.

ddebrunner commented 7 years ago

Would there be any relation to https://github.com/IBMStreams/streamsx.ngrams ?

markheger commented 7 years ago

The name "n-grams" is the only similarity. The algorithms are quite different. The streamsx.ngrams implements a rolling-hash, but our Ngram operator is based on the logic described here: http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html.

ddebrunner commented 7 years ago

But should we merge the two toolkits, bringing the streamsx.ngrams functionality into the text toolkit?

vdogaru commented 7 years ago

+1

mikespicer commented 7 years ago

+1 Can you please let us know what the plan would be for the N-Gram toolkit once this text toolkit is created. Would you bring the NGram toolkit functionality into this text toolkit and deprecate/delete the N-Gram Toolkit? I don't think we should delay creating the text toolkit until that is done but think it would be good to create an issue to track how the overlap will be handled at the time the toolkit is created.

markheger commented 7 years ago

I would propose to rename the com.ibm.streamsx.ngrams::Ngrams operator to RollingHash and then move it from streamsx.ngrams to text toolkit. @leongor: Would you agree?

leongor commented 7 years ago

@markheger I agree to merge it. I'm not sure RollingHash operator name is the right one. It describes the algorithm used to find n-grams, but I don't think it's very informative for the end user (SPL developer), because its purpose is still to extract all n-grams from the string.

markheger commented 7 years ago

Proposal for the streamsx.ngrams merge to streamsx.text toolkit: a) rename com.ibm.streamsx.ngrams::Ngrams to com.ibm.streamsx.text::Ngrams

b) rename com.ibm.streamx.text::Ngram to com.ibm.streamx.text::NgramsBasic

c) After the merge, the streamsx.ngrams repository should be set to deprecated and should give a hint and link to the new location.

d) @leongor: As pre-condition for the merge is that a sample application must be created to demonstrate the usage of the Ngrams operator.

For the streamsx.text repository the following users should to be set as "inital committers": markheger joergboe hleuschner leongor

hleuschner commented 7 years ago

@leongor: Would you agree to the latest proposal from Mark?

chanskw commented 7 years ago

Set up repository.

leongor commented 7 years ago

@hleuschner Yes.