EnterpriseDB / mysql_fdw

PostgreSQL foreign data wrapper for MySQL
Other
531 stars 162 forks source link

REGEX doesn't work #77

Closed frankgard closed 8 years ago

frankgard commented 8 years ago

Thanks to Git-Commit 9756b7070… I was able to compile mysql_fdw against 9.5b2 on my Debian GNU/Linux 8.x ("Jessie") using PGDG-PostgreSQL-Packages and a current version of MariaDB (compilation against libmysqlclient-dev). Great and thx a lot!

Unfortunately I ran into a problem trying to use RegExpressions. The MySQL like syntax doesn't work as the PostgreSQL like syntax, too. Here what I tried:

  1. The MySQL part works fine:

    MariaDB [testdb]> SELECT text_val,text_locale FROM geodb_textdata
    WHERE text_val NOT REGEXP '^\\d+$' LIMIT 10
    ;
    +-------------+-------------+
    | text_val    | text_locale |
    +-------------+-------------+
    | Österreich  | de          |
    | OESTERREICH | de          |
    | A           | NULL        |
    | A           | NULL        |
    | Bundesland  | de          |
    | Burgenland  | de          |
    | BURGENLAND  | de          |
    | Bezirk      | de          |
    | Eisenstadt  | de          |
    | EISENSTADT  | de          |
    +-------------+-------------+
    10 rows in set (0.00 sec)
    
  2. Using a mysql_fdw foreign table and the MySQL-style syntax doesn't work:

    testdb=> SELECT * FROM maria_geodb_textdata WHERE text_val NOT REGEX '^\d+$';
    FEHLER:  Syntaxfehler bei „NOT“
    ZEILE 1: SELECT * FROM maria_geodb_textdata WHERE text_val NOT REGEX ...
                                                              ^
    testdb=> SELECT * FROM maria_geodb_textdata WHERE NOT text_val REGEX '^\d+$';
    FEHLER:  Syntaxfehler bei „REGEX“
    ZEILE 1: ...CT * FROM maria_geodb_textdata WHERE NOT text_val REGEX '^\d...
                                                                 ^
    
  3. Using a mysql_fdw foreign table and the PostgreSQL-style syntax doesn't work:

    testdb=> SELECT * FROM maria_geodb_textdata WHERE text_val !~ '^\d+$';
    FEHLER:  failed to prepare the MySQL query:
    You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ''^\\d+$'))' at line 1
    Zeit: 1,406 ms
    
  4. Access to the foreign table using a less special syntax, e.g. LIKE, works as expected:

    testdb=> SELECT text_val, text_locale FROM maria_geodb_textdata
    WHERE text_val LIKE 'A_%' LIMIT 10
    ;
    text_val | text_locale
    ----------+-------------
    Antau    | de
    ANTAU    | de
    Antau    | de
    ANTAU    | de
    Andau    | de
    ANDAU    | de
    Andau    | de
    ANDAU    | de
    Apetlon  | de
    APETLON  | de
    (10 Zeilen)
    
ahsanhadi commented 8 years ago

Can you please send your complete test case so we can verify our fix with your test case before committing the fix?

frankgard commented 8 years ago

Please find attached the Dumpfile of the MySQL-/MariaDB-table "geodb_textdata", and the SQL Statements for creating the foreign table. I hope these help. Please let me know, if you need anything else! Thx.

geodb_textdata.zip mysql_fdw.sql.txt

ahsanhadi commented 8 years ago

The following query is not longer throwing an error after the fix that Ibrar committed last week.

SELECT * FROM maria_geodb_textdata WHERE text_val !~ '^\d+#39;;

I also ran PG style reg expression query with your database and it seem to be returning correct results :

edb=# SELECT loc_id,text_val FROM maria_geodb_textdata WHERE text_val ~ '^kulm$'; LOG: statement: SELECT loc_id,text_val FROM maria_geodb_textdata WHERE text_val ~\ '^kulm$'; loc_id | text_val --------+---------- 60051 | KULM 61130 | Kulm 61130 | KULM 61330 | Kulm 61330 | KULM 62691 | Kulm 62691 | KULM 63083 | Kulm 63083 | KULM 65590 | Kulm 65590 | KULM 72602 | Kulm 72602 | KULM 76929 | KULM 76930 | KULM 77279 | KULM 151252 | Kulm 151252 | KULM 78626 | KULM (19 rows)

However i am not sure if it is returning correct results,

edb=# SELECT count(_) FROM maria_geodb_textdata WHERE textval !~ '^\d+#39;; LOG: statement: SELECT count() FROM maria_geodb_textdata WHERE text_val !~ '^\d+#39;;

count

663923 (1 row)

what does it return for you?

On Sun, Jan 3, 2016 at 6:18 PM, frankgard notifications@github.com wrote:

Please find attached the Dumpfile of the MySQL-/MariaDB-table "geodb_textdata", and the SQL Statements for creating the foreign table. I hope these help. Please let me know, if you need anything else! Thx.

geodb_textdata.zip https://github.com/EnterpriseDB/mysql_fdw/files/76591/geodb_textdata.zip mysql_fdw.sql.txt https://github.com/EnterpriseDB/mysql_fdw/files/76605/mysql_fdw.sql.txt

— Reply to this email directly or view it on GitHub https://github.com/EnterpriseDB/mysql_fdw/issues/77#issuecomment-168497105 .

Ahsan Hadi Snr Director Product Development EnterpriseDB Corporation The Enterprise Postgres Company

Phone: +92-51-8358874 Mobile: +92-333-5162114

Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

frankgard commented 8 years ago

Thank you! There are subtle differences in REGEX-syntax of PostgreSQL and MySQL/MariaDB, especially concerning the number of backslashes ('\d' vs. '\d'):

MariaDB [exacfg]> SELECT COUNT(*) FROM geodb_textdata WHERE text_val NOT REGEXP '^\\d+$';
+----------+
| COUNT(*) |
+----------+
|   297117 |
+----------+
1 row in set (0.48 sec)
MariaDB [exacfg]> SELECT COUNT(*) FROM geodb_textdata WHERE text_val NOT REGEXP '^\d+$';
+----------+
| COUNT(*) |
+----------+
|   663848 |
+----------+
1 row in set (0.43 sec)

I.e. the count I get using just a single backslash (which is not(!) what is meant by the identical PostgreSQL-RegEx!) is very similar (but not exactly equal to) your result. At the moment I can't imagine where the difference comes from, so further investigation is required…

I have to recompile newest mysql_fdw code against Pg9.5rc1 to check your result and find out what could be the reason for the deviation. Unfortunately this has to wait a moment :( . Hopefully I can do this tomorrow… Cheers, Frank.

frankgard commented 8 years ago

Yeah!!!!!! I managed to compile current mysql_fdw against PGDG's PostgreSQL 9.5rc1 on my Debian 8 ("Jessie") box, and fortunately RegExpressions now seem to work using PostgreSQL-syntax:

exacfg=> SELECT COUNT(*) FROM maria_geodb_textdata WHERE text_val !~ '^\d+$';
 count  
--------
 297117
(1 Zeile)
exacfg=> SELECT count(*) FROM maria_geodb_textdata WHERE text_val !~ '^\\d+$';
 count  
--------
 663923
(1 Zeile)

The first result is the most important one. The latter is simply "all rows", because none of the "text_val"s contains a literal backslash at the beginning followed by at least one digit at the end:

MariaDB [exacfg]> SELECT COUNT(*) FROM geodb_textdata WHERE text_val NOT REGEXP '\\\\d+$';
+----------+
| COUNT(*) |
+----------+
|   663923 |
+----------+
1 row in set (0.39 sec)
MariaDB [exacfg]> SELECT COUNT(*) FROM geodb_textdata;
+----------+
| COUNT(*) |
+----------+
|   663923 |
+----------+
1 row in set (0.23 sec)

Now my explanation of the difference:

MariaDB [exacfg]> SELECT text_val,COUNT(*) AS anzahl FROM geodb_textdata WHERE text_val REGEXP '^\d+$' GROUP BY text_val;
+----------+--------+
| text_val | anzahl |
+----------+--------+
| D        |     40 |
| DD       |     35 |
+----------+--------+
2 rows in set (0.47 sec)
exacfg=> SELECT count(*) FROM maria_geodb_textdata WHERE text_val ~ '^d+$';
 count 
-------
     0
(1 Zeile)
exacfg=> SELECT count(*) FROM maria_geodb_textdata WHERE text_val ~* '^d+$';
 count 
-------
    75
(1 Zeile)
exacfg=> SELECT text_val,count(*) FROM maria_geodb_textdata WHERE text_val ~* '^d+$' GROUP BY text_val;
 text_val | count 
----------+-------
 D        |    40
 DD       |    35
(2 Zeilen)

Obviously, the deviation has to do with the different case (upper/lower) default behaviour of RegExs in PostgreSQL (case sensitive) and MariaDB/MySQL (case insensitive). Very probably this can be adjusted by MariaDB-configuration, but I'm no MySQL expert and don't want to spend too much time with these details :) .

Summary: Thank you very, very much! Great work!!!! Frank.