bopoda / robots-txt-parser

PHP class for parse all directives from robots.txt files according to specifications
http://robots.jeka.by
MIT License
44 stars 17 forks source link

Problems with iconv() #43

Closed LeMoussel closed 7 years ago

LeMoussel commented 7 years ago

On a server, I have Alpine Linux installed. Alpine Linux is a security-oriented, lightweight Linux distribution based on musl libc and busybox.

There are functional differences between Debian and Alpine from glibc and among others with iconv. See:

In some cases, with Alpine Linux, in PHP, you may encounter problems with iconv() and the option // ignore. That means you can't currently use this function to filter invalid characters. Instead it silently fails and returns an empty string (or you'll get a notice but only if you have E_NOTICE enabled). Réf: PHP Manual inconv()

Like describe, in iconv() PHP Manual, in RobotsTxtParser::__construct(), Line 63

// set content $this->content = iconv(mb_detect_encoding($content, mb_detect_order(), false), "UTF-8//IGNORE", $content);

can be replace by

$this->content = mb_convert_encoding($content, 'UTF-8', $encoding);

So I propose this

// Strip invalid characters from UTF-8 strings
ini_set('mbstring.substitute_character', "none");

/**
* @param string $content Robots.txt content
* @param string $encoding Encoding
* @return RobotsTxtParser
*/
public function __construct($content, $encoding = self::DEFAULT_ENCODING)
{
   // convert encoding
   $encoding = !empty($encoding) ? $encoding : mb_detect_encoding($content, mb_detect_order(), false);

   // set content
   $this->content = mb_convert_encoding($content, 'UTF-8', $encoding);

   $this->prepareRules();
} 
bopoda commented 7 years ago

@LeMoussel thanks for feedback!

ini_set('mbstring.substitute_character', "none");

is it really necessary?

I tried your code and almost all is great except this one: https://github.com/bopoda/robots-txt-parser/blob/master/tests/data/robots9/robots.txt https://github.com/bopoda/robots-txt-parser/blob/master/tests/data/robots9/expectedAllow

/allowpath should be allowed. http://joxi.ru/1A5QGRdTnkqVPA

Latest rules transformed to "/*" (all strange characters trimmed), so it means all urls will be disallow in this case.

ini_set('mbstring.substitute_character', "none");

without this line everything is good. You can create PR if you want :)

LeMoussel commented 7 years ago

I think it's necesary to strip invalid characters from UTF-8 strings to "none".

robots9/robots.txt file is encoded in ANSI, not UTF-8. Therefore it's normal that latest rules transformed to "/*" é,Ã,é,... are invalid UTF-8 characters. #51

I do PR. It's fail with robots9/robots.txt :)

1) CommonTest::testIsUrlAllow with data set #20 ('User-Agent: \nDisallow: /bd/\n...ow: /�\n', '/allowPath')

Failed asserting that false is true.