gabordemooij / redbean

ORM layer that creates models, config and database on the fly
https://www.redbeanphp.com
2.3k stars 279 forks source link

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x98\xBA\xEF\xB8\x8E...' #916

Closed hashborgir closed 1 year ago

hashborgir commented 1 year ago

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xE2\x98\xBA\xEF\xB8\x8E...' for columnnews.article.htmlat row 1Processing https://www.healthline.com/health-news/magic-mushroom-compound-called-psilocybin-may-help-treat-depression

PHP 8.2

Latest RB

I'm scraping some websites and saving the entire html output into a db field called html. It fails if it detects UTF8 or another ASCII type backslash codes like \xE2 etc.

Lynesth commented 1 year ago

You seem to be trying to save that as a datetime? How is that supposed to work? Are you sure your parsing is working properly?

hashborgir commented 1 year ago

What are you talking about? Please elaborate. I'm not saving anything other than scraped HTML. All I'm doing is grabbing the entire page HTML and save it to a field called html. Why is redbean talking about datetime? What does this have anything to do with datetime? The HTML content could contain any text. Why does redbean complain when saving that text to a text field?

<?php

    ini_set('display_errors', 1);
    error_reporting(E_ERROR| E_PARSE);

    require_once 'rb-mysql.php';
    R::setup('mysql:host=127.0.0.1;dbname=news', 'user', 'pass');

    function dd($var){
        var_dump($var);
        die();
    }

    function get_html($url) {
        $ua = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:110.0) Gecko/20100101 Firefox/110.0";
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_USERAGENT, $ua);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        $response = curl_exec($ch);
        curl_close($ch);

        return $response;
    }

    function get_header($url) {
        $ua = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:110.0) Gecko/20100101 Firefox/110.0";
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_NOBODY, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_USERAGENT, $ua);
        curl_setopt($ch, CURLOPT_VERBOSE, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        $response = curl_exec($ch);
        curl_close($ch);
        return $response;
    }

    $total = R::getCell('SELECT count(*) FROM article');

    $i = 0;
    while ($i <= $total) {
        $article = R::findOne('article', 'scraped=0');
        echo "Processing {$article->link}" . PHP_EOL;

        try {
            if (!$headers = get_header($article->link)) {
                echo "TRASHED: ". $article->link;
                R::trash($article);
                throw new Exception('Could not get headers');

            }
        }
        catch (Exception $e) {
            echo $e->getMessage();
        }

        if ($headers['http_code'] == 403){
            echo "TRASHED: ". $article->link;
            R::trash($article);
        } else {
            // scrape data here
            try {
                if (!$html = get_html($article->link)){
                    echo "TRASHED: ". $article->link;
                    R::trash($article);
                    throw new Exception('Could not get HTML');

                } else if ($html = get_html($article->link)){
                    $enc = mb_detect_encoding($html);
                    $article->html = trim($html);
                    if (empty($enc)){
                        echo "TRASHED: ". $article->link;
                        R::trash($article);
                    } else {
                        echo "OKAY: ". $article->link . PHP_EOL;
                        $article->scraped = 1;
                        R::store($article);
                    }
                }
            }
            catch (Exception $e) {
                echo $e->getMessage();
            }
        }
        $i++;
    }
hashborgir commented 1 year ago
                    $article->html = trim($html);
                    $article->setMeta('cast.html', "longtext");

I even tried this, cast .html field to longtext explicity, but it gives the same error.

I am very confused by this. Could you elaborate upon what you're talking about? Thank you :)

Lynesth commented 1 year ago

I was just reading your initial post. The error says

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value

Which means you're trying to fit a value in a datetime column that isn't a proper datetime format. So I was wondering what your code was doing.

Can you show me what your table looks like?

hashborgir commented 1 year ago
mysql> desc article;
+-------------+------------------+------+-----+---------+----------------+
| Field       | Type             | Null | Key | Default | Extra          |
+-------------+------------------+------+-----+---------+----------------+
| id          | int(11) unsigned | NO   | PRI | NULL    | auto_increment |
| title       | varchar(255)     | YES  |     | NULL    |                |
| description | varchar(255)     | YES  |     | NULL    |                |
| link        | text             | YES  |     | NULL    |                |
| published   | datetime         | YES  |     | NULL    |                |
| category    | varchar(191)     | YES  |     | NULL    |                |
| image_link  | text             | YES  |     | NULL    |                |
| image_file  | varchar(191)     | YES  |     | NULL    |                |
| verified    | tinyint(4)       | YES  |     | 0       |                |
| scraped     | tinyint(4)       | YES  |     | 0       |                |
+-------------+------------------+------+-----+---------+----------------+
hashborgir commented 1 year ago

Here's another version of the script. Here's I'm using php-goose to extract article text and save to a field called goose which should be longtext based on the length of the textual data, and that also gives me the same SQL insert error, something about datetime... when I can't understand why it has anything to do with datetime.

If there are any \xE9\xB8 type chars or \u0002 type chars it seems, in the string, or text, it starts giving this error about DateTime format.

I know that redbean tries to determine the type based on the content. If it is complaining about DateTime format when i'm trying to insert a string with these special chars, then it misunderstands it as DateTime instead of longtext? But if I even explicitly cast the field using bean's meta.cast feature, it still complains.

<?php

    ini_set('display_errors', 1);
    error_reporting(E_ERROR| E_PARSE);

    require_once 'rb-mysql.php';
    R::setup('mysql:host=127.0.0.1;dbname=news', 'user', 'pass');
    require __DIR__ . '/vendor/autoload.php';
    use \Goose\Client as GooseClient;
    function dd($var){
        var_dump($var);
        die();
    }

    function get_html($url) {
        $ua = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:110.0) Gecko/20100101 Firefox/110.0";
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_USERAGENT, $ua);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        $response = curl_exec($ch);
        curl_close($ch);

        return $response;
    }

    function get_header($url) {
        $ua = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:110.0) Gecko/20100101 Firefox/110.0";
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_HEADER, 1);
        curl_setopt($ch, CURLOPT_NOBODY, 1);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
        curl_setopt($ch, CURLOPT_USERAGENT, $ua);
        curl_setopt($ch, CURLOPT_VERBOSE, 1);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        $response = curl_exec($ch);
        curl_close($ch);
        return $response;
    }

    $total = R::getCell('SELECT count(*) FROM article');

    $i = 0;
    while ($i <= $total) {
        $article = R::findOne('article', 'scraped=0');
        echo "Processing {$article->link}" . PHP_EOL;

        try {
            if (!$headers = get_header($article->link)) {
                echo "TRASHED: ". $article->link;
                R::trash($article);
                throw new Exception('Could not get headers');

            }
        }
        catch (Exception $e) {
            echo $e->getMessage();
        }

        if ($headers['http_code'] == 403){
            echo "TRASHED: ". $article->link;
            R::trash($article);
        } else {
            // scrape data here
            try {
                if (!$html = get_html($article->link)){
                    echo "TRASHED: ". $article->link;
                    R::trash($article);
                    throw new Exception('Could not get HTML');

                } else if ($html = get_html($article->link)){
                    $enc = mb_detect_encoding($html);
                    if (empty($enc)){
                        echo "TRASHED: ". $article->link;
                        R::trash($article);
                    } else {
                        echo "OKAY: ". $article->link . PHP_EOL;
                        $goose = new GooseClient();

                        if ($article_goose = $goose->extractContent($article->link, $html)) {
                            $article->goose = $article_goose->getCleanedArticleText();
                            $article->extracted = 1;
                            var_dump($article->goose);
                            R::store($article);
                        }
                        $article->scraped = 1;
                        R::store($article);
                    }

                }
            }
            catch (Exception $e) {
                echo $e->getMessage();
            }

        }
        $i++;
    }

And here's some log of the execution:

< HTTP/2 200 
HTTP/2 200 
< content-type: text/html; charset=utf-8
content-type: text/html; charset=utf-8
< server: openresty
server: openresty
< etag: W/"5503b-WZsMgj/fSjHhp8ufpR8+ATxZb4s"
etag: W/"5503b-WZsMgj/fSjHhp8ufpR8+ATxZb4s"
< last-modified: Sun, 12 Feb 2023 07:04:55 GMT
last-modified: Sun, 12 Feb 2023 07:04:55 GMT
< x-arc-pb-request-id: 8476ad9c-50c6-458e-88d8-f773a674aefc
x-arc-pb-request-id: 8476ad9c-50c6-458e-88d8-f773a674aefc
< akamai-true-ttl: 130
akamai-true-ttl: 130
< cache-control: private, max-age=60
cache-control: private, max-age=60
< expires: Sun, 12 Feb 2023 07:06:04 GMT
expires: Sun, 12 Feb 2023 07:06:04 GMT
< date: Sun, 12 Feb 2023 07:05:04 GMT
date: Sun, 12 Feb 2023 07:05:04 GMT
< content-security-policy: upgrade-insecure-requests
content-security-policy: upgrade-insecure-requests
< strict-transport-security: max-age=31536000
strict-transport-security: max-age=31536000
< content-security-policy: frame-ancestors 'self' https://*.nzherald.co.nz https://*.apnnz.co.nz https://nzme.coral.coralproject.net/;
content-security-policy: frame-ancestors 'self' https://*.nzherald.co.nz https://*.apnnz.co.nz https://nzme.coral.coralproject.net/;
< x-arc-request-id: 0.05063017.1676185504.1b222c8
x-arc-request-id: 0.05063017.1676185504.1b222c8
< akamai-true-ttl: 130
akamai-true-ttl: 130

< 
* Connection #0 to host www.nzherald.co.nz left intact
OKAY: https://www.nzherald.co.nz/nz/cannabis-referendum-government-needs-to-be-courageous-on-drug-reform-criminologist/D4MPA35ZXHUJXQU7S762DHI4OA/
string(2229) "Victoria University associate professor of criminology Fiona Hutton says the Government needs to be courageous and "get on board with global reform". Photo / 123rf

A criminologist says various states in the US have made progressive moves on drug laws and decriminalisation, and the New Zealand Government needs to be courageous.

The final results of the cannabis referendum have not changed the result, but narrow the margin to 50.7 per cent against legalising the drug for personal recreational use and 48.4 per cent in support.

Victoria University associate professor of criminology Fiona Hutton said seeing lots of change around drug law reform across the United States was positive.

"Two more states have legalised, Oregon has legalised psychadelics, psilocybin [magic mushrooms] and also descriminalised everything else.

"So I think it's really really positive and I think it's another reason our Government just needs to stand up to have courage and to get on board with the global reform."

Hutton said the Government should start working towards decriminalisation of cannabis.

Students are disappointed special votes were not enough to change the result on the referendum on cannabis legalisation.

An advocate for cannabis legalisation says despite the referendum decision, change is still needed.

Justice reform advocate Awatea Mita said Māori already bore the brunt of cannabis harms, and the referendum result meant that would continue.

"If not now then it's really just a matter of when because we can't continue to have these detrimental outcomes for Māori communities because it will affect everyone in our community, and so a change in legislation will be a move towards more equitable outcomes."

Mita said people will continue to flout the laws.

After the election, Prime Minister Jacinda Ardern said she had voted "yes" in the referendum on the legalisation of recreational cannabis.

She has committed to progressing legislation "in line with the will of the people following the release of the final results".

Changes had been made to the Misuse of Drugs Act regarding people in possession of cannabis to ensure it was treated as a health matter and this would need to be reinforced, Ardern said."
SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xC4\x81ori ...' for column `news`.`article`.`goose` at row 1Processing https://www.nzherald.co.nz/nz/cannabis-referendum-government-needs-to-be-courageous-on-drug-reform-criminologist/D4MPA35ZXHUJXQU7S762DHI4OA/
hashborgir commented 1 year ago

I have figured the following solution. I compress the text, then get a base64 string for it, then save that into the field, and it's saved

$article->goose = base64_encode(gzencode($article_goose->getCleanedArticleText()));

And goose is created as | goose | longtext |

https://3v4l.org/ZY9Uc

So far this solves my problem.

Rayraz commented 1 year ago

This may have something to do with some utf-8 encoding issues. You could try to make sure your table columns are configured correctly. I won't pretend to understand the full details, but I ran into something similar a while ago and for me the solution was to change the encoding on the table column. There are several suggestions around this type of topic on stack overflow, perhaps one of the suggestions there suits your case.

gabordemooij commented 1 year ago

Can you give a simple example? The following code seems to work fine over here:

$a = R::dispense('article');
$a->html = "\xE9\xB8\u{0002}";
$id = R::store($a);
$a = R::load('article', $id);
dmp($a);
print_r(R::inspect('article'));

Output:

{"id":"1","html":"??\u0002"}

Array
(
    [id] => int(11) unsigned
    [html] => varchar(191)
)