coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
632 stars 91 forks source link

bootstrap_basic raw strings / escapes #86

Closed jaap3 closed 5 years ago

jaap3 commented 5 years ago

I noticed that a lot of the regular expression patterns in bootstrap_basic don't escape dots (match all). This means that a fair number of these patterns will match more than intended.

In addition most patterns aren't marked as raw strings and therefore contain invalid escape sequences. This isn't noticeable directly, but could cause issues in a future python version.

For an example of the latter:

python -W always -c '"https://\S*?soundcloud.com/\S+"' <string>:1: DeprecationWarning: invalid escape sequence \S

coleifer commented 5 years ago

Yes you are right. This is from me merging patches without reviewing them closely enough. I will fix these regexes.

jaap3 commented 5 years ago

I can also work on a PR if you'd like. It'll save you some work

On Mon, 11 Feb 2019, 20:24 Charles Leifer <notifications@github.com wrote:

Yes you are right. This is from me merging patches without reviewing them closely enough. I will fix these regexes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/coleifer/micawber/issues/86#issuecomment-462459706, or mute the thread https://github.com/notifications/unsubscribe-auth/AAC9hZatMvlzbLx8NdHWuuciuLTGoV1Oks5vMcNogaJpZM4a0k3A .

coleifer commented 5 years ago

Think we should be good to go with the last few commits, but thank you for the offer.

coleifer commented 5 years ago

Pushed a new release, 0.4.0