amphi-ai / amphi-etl

Visual Data Transformation with Python Code Generation. Low-Code Python-based ETL.
https://amphi.ai
Other
904 stars 44 forks source link

Parse & Extract : wrong regex for email #192

Open simonaubertbd opened 2 hours ago

simonaubertbd commented 2 hours ago

Hello,

I made a small test with a string that may contains email. Here the result : image

As you can see, one of the results is not correct (..com with two points)

I think you should use this instead (?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:.[a-z0-9!#$%&'+/=?^_`{|}~-]+)|"+"'"+"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])"+"'"+")@(?:(?:[a-z0-9](?:[a-z0-9-][a-z0-9])?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])`

According to https://emailregex.com/, it's the RFC 5322 Official Standard.

(would be even better if you change the label to "Email - RFC 5322, it makes it street cred ;) )

Best regards,

Simon

tgourdel commented 2 hours ago

Thanks Simon, good catch, I'll try your suggestion!