Closed demogorgorn closed 5 years ago
Often this char is russian EL (л)
Array
(
[geographic] => 1
[traveler] => 0.73333333333333
[national] => 0.66666666666667
[получивший] => 0.46666666666667
[название] => 0.46666666666667
[прове?] => 0.33333333333333
[конкурс] => 0.33333333333333
[азербайджан] => 0.33333333333333
[пользователи] => 0.33333333333333
[номинаций] => 0.33333333333333
[зрительских] => 0.26666666666667
[принимали] => 0.26666666666667
[отбирали] => 0.26666666666667
[онлайн] => 0.26666666666667
[основных] => 0.26666666666667
[путем] => 0.26666666666667
[подобрано] => 0.26666666666667
[awards] => 0.26666666666667
[участие] => 0.2
[процента] => 0.2
[самсами] => 0.2
[гонке] => 0.2
[лидерство] => 0.2
[результатом] => 0.2
[италию] => 0.2
[обогнала] => 0.2
[пловом] => 0.2
[набравшую] => 0.2
[голосов] => 0.2
[включало] => 0.2
[октября] => 0.2
[июня] => 0.2
[определении] => 0.2
[журна?] => 0.2
[туризм] => 0.2
[открытого] => 0.2
[отрывом] => 0.2
[большим] => 0.2
[победу] => 0.2
[гастрономический] => 0.2
[большинстве] => 0.2
[своем] => 0.2
[лучших] => 0.2
[лучшие] => 0.2
[режиме] => 0.2
[голосования] => 0.2
[стран] => 0.13333333333333
[путешественникам] => 0.13333333333333
[отменным] => 0.13333333333333
[проходило] => 0.13333333333333
[номинации] => 0.13333333333333
[заня?] => 0.13333333333333
[туризма] => 0.13333333333333
[снг] => 0.066666666666667
[узбекистан] => 0.066666666666667
[голосование] => 0.066666666666667
[симпатий] => 0.066666666666667
[знаменитая] => 0.066666666666667
[вкусной] => 0.066666666666667
[страны] => 0.066666666666667
[помощь] => 0.066666666666667
[2018] => 0.066666666666667
[источник] => 0
[news-asia] => 0
)
Thanks for the report. Could you comment a sample raw text here where this issue occurs?
I dont speak russian, I am hispanic and I had that issue before parsing texts, not with this library, but I think that is something related to encoding. Where the text is coming from? A database, a text file? anyways you should check connection and text files encoding.
@rernesto i don`t see any extras/wrong non-Cyrillic chars in source files
You can check it here: https://github.com/DavidBelicza/PHP-Science-TextRank/pull/5/files
PS: there is no such stopword containing прове
part. Looks like its in your source.
You’re right. I have same issue. Also my text contains html tags and the parser include the tags in keywords array. So I believe the problem happens when TextRank parse the text. That was today at 1am, I went to sleep. I will check it on detail in the afternoon. I’ll let u know.
@DavidBelicza Solved... Two birds down, one shot (also fix the html tags issue) the issue is on Parser.php. So far my solution works with Cyrillic characters too. There is my solution (sorry I have no time to fork and do the pull request, your code is commented):
...
protected function getWords(string $subText): array
{
// $words = preg_split(
// '/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/',
// $subText,
// -1,
// PREG_SPLIT_NO_EMPTY
// );
$words = [];
preg_match_all('/\w\w+/u', $subText, $words);
$words = array_values(
array_filter(
array_map(
[$this, 'cleanWord'],
$words[0]
// $words
)
)
);
...
@DavidBelicza Anyways I need to reuse some of your code (copy&paste). Your name gonna be on my class comments. Vector norm must be calculated in the right way (I am using L2 norm) and I need to use idtf score. I am working on a text mining project for Christian community. Is available at Bible Miner
@DavidBelicza Solved... Two birds down, one shot (also fix the html tags issue) the issue is on Parser.php. So far my solution works with Cyrillic characters too. There is my solution (sorry I have no time to fork and do the pull request, your code is commented):
... protected function getWords(string $subText): array { // $words = preg_split( // '/(?:(^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', // $subText, // -1, // PREG_SPLIT_NO_EMPTY // ); $words = []; preg_match_all('/\w\w+/u', $subText, $words); $words = array_values( array_filter( array_map( [$this, 'cleanWord'], $words[0] // $words ) ) ); ...
@rernesto This word extraction process is prepared to extract words from a text. So it can't contain any HTML parsing. If HTML parsing would be here, then what about other data structures like XML, JSON, CSV, etc?
I suggest use an HTML parser PHP library for that purpose and give a clean raw text to the TextRankFacade after HTML parser retrieved the text.
@DavidBelicza Check PHPML. Basically I did a copy&paste of your code and replace some methods code with php-ml methods. On your Score class as an example
protected function normalizeAndSortScores(array &$scores): array
{
// foreach ($scores as $key => $value) {
// $v = $this->normalize(
// $value,
// $this->minimumValue,
// $this->maximumValue
// );
//
// $scores[$key] = $v;
// }
$normalizer = new Normalizer();
$values = [array_values($scores)];
$keys = array_keys($scores);
$normalizer->transform($values);
$scores = array_combine($keys, $values[0]);
arsort($scores);
return $scores;
}
You could contribute to that project and re-use some of their code. If you don't wanna do it eventually I will, and how I told you before I'll keep your author disclaimer in source files.
@DavidBelicza I did not see your answer... I wanna thank you for your job on this library. Anyway using PHP-ML word tokenizer extracts words tokens regardless the format of the source file.
In some cases with Russian stop words last char of keyword is "?"