Stop words - Ignoring case

angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP

Do What The F*ck You Want To Public License

743 stars 152 forks source link

Stop words - Ignoring case #36

Closed JulienMalige closed 8 years ago

JulienMalige commented 8 years ago

Hey,

I don't know if this project is still maintained, but the topic is very interesting and the work is awesome.

I'm using the stop words transformation. My words are load from an external file, which is a list of words in lower case (ex : "are, you, a..."). When I'm applying the transformation, if my origin word contains an uppercase (ex : "Are you") the "Are" word is not considered like a stop word (because "Are" != "are" ).

Don't you think is better to using a strtolower during the applyTransformation work ? Because we can imagine that my origin word can contained uppercase (even if I use the Normalizers).

Thanks.

#NlpTools/Documents/TokensDocument.php

function applyTransformation(TransformationInterface $transform) 
    // array_values for re-indexing
    $this->tokens = array_values(
        array_filter(
            array_map(
                array($transform, 'transform'),
                // My change 
                array_map('strtolower', $this->tokens)
                ),
            function ($token) {
                return $token !== null;
            }
        )
    );
}

angeloskath commented 8 years ago

Hi Julien,

Thanks for the good words. The project is not really heavily maintained but I 'm still here.

Regarding your issue, you should use one of the NlpTools\Utils\Normalizers to normalize the tokens first. In fact for English all it does is apply the mb_strtolower function to make your tokens lower case.

Now in case we need to keep the document unchanged but still have the StopWords filter based on the normalized words we should add an optional Transformation to the StopWords to apply to the token in StopWords::transform before checking that it is a stop word, and then returning the original (in case it isn't a stop words of course).

So I would change #37 because in general we do not want the transformation to happen in lower cased tokens plus that strtolower only works with ascii characters.

I would be glad to merge a PR that adds an optional transformation to the StopWords though.

JulienMalige commented 8 years ago

You are right, NlpTools\Utils\Normalizers made the job. Sorry but I work with french so I did not dare to use the Normalizer::factory("English"). Later, I will try to implement Stemmers and Normalizer for French language.

I'm not sure to have the skill to make the optional transformation. Did you thought about something like :

    public function transform($token, $options)
    {
        if (array_key_exists('mb_strtolowere, $options)) {
            $token = mb_strtolower($token, $options['mb_strtolower']);
        }

        if (!isset($this->stopwords[$token])){
            return $token;
        }
        return null;
    }

Where I suppose that $options = [ 'mb_strtolower' => 'utf-8'].

angeloskath commented 8 years ago

No we wouldn't want to change the TransformationInterface which the transform method implements.

What I meant is to give another TransformationInterface to the StopWords class which it can use to transform the token before checking if it exists in the stop words list. Something like the following

$tocheck = $this->inner_transform->transform($token);

return isset($this->stopwords[$tocheck]) ? null : $token;

allows us for instance to have a stop words list of stemmed words but keep the original tokens.

angeloskath commented 8 years ago

Julien, I am really sorry for forgetting this for soooo long.

I just merged your pull request. You should have pinged me again. Anyway, sorry.

JulienMalige commented 8 years ago

@angeloskath No problem, it was not an emergency. Thanks for the merge 👍