anthonydb / practical-sql

Code and Data for the First Edition of "Practical SQL" by Anthony DeBarros, published by No Starch Press (2018).
Other
650 stars 399 forks source link

Regex match to newline characters on windows10 #16

Open ChandlerLutz opened 5 years ago

ChandlerLutz commented 5 years ago

Hi Anthony,

Thank you for the wonderful book.

I'm working through Ch. 13 on regular expressions on Windows10 using postgreSQL10 in pgAdmin4

I was having trouble getting the regular expressions to work for example in code listing 13-7.

I believe that the issue is related to the way new line characters are handled on windows.

This also may be related to the following issues (I am using a clean version of the imported csv file from the crime data):

https://github.com/anthonydb/practical-sql/issues/4 and https://github.com/anthonydb/practical-sql/issues/10

I was able to solve this issue with this SO answer: https://stackoverflow.com/a/20056634. Apparently windows may match newlines to \r\n

Here is my sql code for the crime time and the output, where crime_type_orig is the original from the book and the other crime_type2 and crime_type3 are based on the above SO answer:

select 
    regexp_match(original_text, '\n(?:\w+ \w+|\w+)\n(.*):') as crime_type_orig,
    -- See https://stackoverflow.com/a/20056634
    regexp_match(original_text, '\r\n(?:\w+ \w+|\w+)\r\n(.*):') as crime_type2,
    -- Based on https://stackoverflow.com/a/20056634
    regexp_match(original_text, '(?:\r\n|\r|\n)(?:\w+ \w+|\w+)(?:\r\n|\r|\n)(.*):') as crime_type3
from crime_reports;

Here is the output from pgAdmin

image

anthonydb commented 5 years ago

Hi, @ChandlerLutz! Thanks for sharing this and glad you got it working.

Did you experience a similar issue with any of the other regexes, or just the one for the crime_type?

I tried just now with a new install of PostgreSQL 11 and pgAdmin 4.1 on a on a Windows 7 laptop and a fresh download of the data from this repo, and the book's code worked as expected. So, I'm a bit puzzled as to what's occurring on your side. It sounds to me like you did everything as expected, including a clean download of the file.

Please let me know your thoughts, and we'll leave this issue open in case others run into the same.

ChandlerLutz commented 5 years ago

Hi Anthony,

So, I essentially ran in to the problem every time that a regex pattern contain the newline character \n. It might be some setting on my Windows10 machine (If I figure it out, I'll let you know). In the end I just replaced every \n with the negated capture group (?:\r\n|\r|\n). This worked well and gave me a little more regex practice!