Norconex / committer-sql

Implementation of Norconex Committer for SQL (JDBC) databases.
https://opensource.norconex.com/committers/sql/
Apache License 2.0
1 stars 6 forks source link

Incorrect String Value while committing to MYSQL #5

Closed HappyCustomers closed 5 years ago

HappyCustomers commented 5 years ago

Dear Mr. Paul, I am geting the following error while committing to MYSQL 5.7 version

Caused by: java.sql.SQLException: Incorrect string value: '\xF0\x9F\x91\x89 I...' for column 'content' at row 1 Query: INSERT INTO bangalore_email_test(starturl,document_reference,Server,id,title,email,Date,content)

I had raised this issue earlier https://github.com/Norconex/committer-sql/issues/3 and U had suggested to use ReplaceTagger.

I tried the following config


<transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer" caseSensitive="false">
<replace>
<fromValue>[\xF0\x9F\x91\x89]{1,}</fromValue>
<toValue>newyork</toValue>
</replace>
</transformer>

And this configuration

    <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
                        <replace replaceAll="true">
                            <fromValue>[\xF0\x9F\x91\x89]{1,}</fromValue>
                            <toValue>orange</toValue>
                        </replace>
    </tagger>

I don't need this Hexadecimal data. Please advice how to use ReplaceTransfomer or ReplaceTagger

this is the special character I am getting from the webpage

👉

Regards Balu

essiembre commented 5 years ago

A few people have this problem with MySQL. Seems to be when dealing with supplementary characters (like emoticons/emojis). You have to tell your MySQL to use utf8mb4. Either as part of your createTableSQL or createFieldSQL, you can make sure to have the following (change the beginning):

yourfield varchar(999) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL

You can find other examples online.

To really replace those instead, you will have to find their unicode equivalent (as those are hex right now). For instance, it looks like \xF0\x9F\x91\x89 translates to \u1F449 (pointing finger icon 👉).

HappyCustomers commented 5 years ago

As a stop gap solution we had set the charset to utf8mb4 at the server level setting in my.ini and for the content column and we were able to continue the crawling.

I will try the \u1F449in the replaceTagger

Thank you and will close this question as the settings in mysql will suffice now for our project