DavidBelicza / PHP-Science-TextRank

:zap: :elephant: TextRank (resource-efficient and low-cost automatic text summarisation) for PHP
https://php.science/textrank/
MIT License
243 stars 40 forks source link

In some cases with Russian stop words last char of keyword is "?" #7

Closed demogorgorn closed 5 years ago

demogorgorn commented 6 years ago

In some cases with Russian stop words last char of keyword is "?"

demogorgorn commented 6 years ago

Often this char is russian EL (л)

demogorgorn commented 6 years ago
Array
(
    [geographic] => 1
    [traveler] => 0.73333333333333
    [national] => 0.66666666666667
    [получивший] => 0.46666666666667
    [название] => 0.46666666666667
    [прове?] => 0.33333333333333
    [конкурс] => 0.33333333333333
    [азербайджан] => 0.33333333333333
    [пользователи] => 0.33333333333333
    [номинаций] => 0.33333333333333
    [зрительских] => 0.26666666666667
    [принимали] => 0.26666666666667
    [отбирали] => 0.26666666666667
    [онлайн] => 0.26666666666667
    [основных] => 0.26666666666667
    [путем] => 0.26666666666667
    [подобрано] => 0.26666666666667
    [awards] => 0.26666666666667
    [участие] => 0.2
    [процента] => 0.2
    [самсами] => 0.2
    [гонке] => 0.2
    [лидерство] => 0.2
    [результатом] => 0.2
    [италию] => 0.2
    [обогнала] => 0.2
    [пловом] => 0.2
    [набравшую] => 0.2
    [голосов] => 0.2
    [включало] => 0.2
    [октября] => 0.2
    [июня] => 0.2
    [определении] => 0.2
    [журна?] => 0.2
    [туризм] => 0.2
    [открытого] => 0.2
    [отрывом] => 0.2
    [большим] => 0.2
    [победу] => 0.2
    [гастрономический] => 0.2
    [большинстве] => 0.2
    [своем] => 0.2
    [лучших] => 0.2
    [лучшие] => 0.2
    [режиме] => 0.2
    [голосования] => 0.2
    [стран] => 0.13333333333333
    [путешественникам] => 0.13333333333333
    [отменным] => 0.13333333333333
    [проходило] => 0.13333333333333
    [номинации] => 0.13333333333333
    [заня?] => 0.13333333333333
    [туризма] => 0.13333333333333
    [снг] => 0.066666666666667
    [узбекистан] => 0.066666666666667
    [голосование] => 0.066666666666667
    [симпатий] => 0.066666666666667
    [знаменитая] => 0.066666666666667
    [вкусной] => 0.066666666666667
    [страны] => 0.066666666666667
    [помощь] => 0.066666666666667
    [2018] => 0.066666666666667
    [источник] => 0
    [news-asia] => 0
)
DavidBelicza commented 6 years ago

Thanks for the report. Could you comment a sample raw text here where this issue occurs?

rernesto commented 5 years ago

I dont speak russian, I am hispanic and I had that issue before parsing texts, not with this library, but I think that is something related to encoding. Where the text is coming from? A database, a text file? anyways you should check connection and text files encoding.

mvcaaa commented 5 years ago

@rernesto i don`t see any extras/wrong non-Cyrillic chars in source files

You can check it here: https://github.com/DavidBelicza/PHP-Science-TextRank/pull/5/files

PS: there is no such stopword containing прове part. Looks like its in your source.

rernesto commented 5 years ago

You’re right. I have same issue. Also my text contains html tags and the parser include the tags in keywords array. So I believe the problem happens when TextRank parse the text. That was today at 1am, I went to sleep. I will check it on detail in the afternoon. I’ll let u know.

rernesto commented 5 years ago

@DavidBelicza Solved... Two birds down, one shot (also fix the html tags issue) the issue is on Parser.php. So far my solution works with Cyrillic characters too. There is my solution (sorry I have no time to fork and do the pull request, your code is commented):

...
protected function getWords(string $subText): array
    {
//        $words = preg_split(
//            '/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/',
//            $subText,
//            -1,
//            PREG_SPLIT_NO_EMPTY
//        );

        $words = [];
        preg_match_all('/\w\w+/u', $subText, $words);

        $words = array_values(
            array_filter(
                array_map(
                    [$this, 'cleanWord'],
                    $words[0]
//                    $words
                )
            )
        );
...
rernesto commented 5 years ago

@DavidBelicza Anyways I need to reuse some of your code (copy&paste). Your name gonna be on my class comments. Vector norm must be calculated in the right way (I am using L2 norm) and I need to use idtf score. I am working on a text mining project for Christian community. Is available at Bible Miner

DavidBelicza commented 5 years ago

@DavidBelicza Solved... Two birds down, one shot (also fix the html tags issue) the issue is on Parser.php. So far my solution works with Cyrillic characters too. There is my solution (sorry I have no time to fork and do the pull request, your code is commented):

...
protected function getWords(string $subText): array
    {
//        $words = preg_split(
//            '/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/',
//            $subText,
//            -1,
//            PREG_SPLIT_NO_EMPTY
//        );

        $words = [];
        preg_match_all('/\w\w+/u', $subText, $words);

        $words = array_values(
            array_filter(
                array_map(
                    [$this, 'cleanWord'],
                    $words[0]
//                    $words
                )
            )
        );
...

@rernesto This word extraction process is prepared to extract words from a text. So it can't contain any HTML parsing. If HTML parsing would be here, then what about other data structures like XML, JSON, CSV, etc?

I suggest use an HTML parser PHP library for that purpose and give a clean raw text to the TextRankFacade after HTML parser retrieved the text.

rernesto commented 5 years ago

@DavidBelicza Check PHPML. Basically I did a copy&paste of your code and replace some methods code with php-ml methods. On your Score class as an example

protected function normalizeAndSortScores(array &$scores): array
    {
//        foreach ($scores as $key => $value) {
//            $v = $this->normalize(
//                $value,
//                $this->minimumValue,
//                $this->maximumValue
//            );
//
//            $scores[$key] = $v;
//        }

        $normalizer = new Normalizer();
        $values = [array_values($scores)];
        $keys = array_keys($scores);
        $normalizer->transform($values);
        $scores = array_combine($keys, $values[0]);

        arsort($scores);

        return $scores;
    }

You could contribute to that project and re-use some of their code. If you don't wanna do it eventually I will, and how I told you before I'll keep your author disclaimer in source files.

rernesto commented 5 years ago

@DavidBelicza I did not see your answer... I wanna thank you for your job on this library. Anyway using PHP-ML word tokenizer extracts words tokens regardless the format of the source file.