Tjatse / node-readability

Scrape/Crawl article from any site automatically. Make any web page readable, no matter Chinese or English.
341 stars 36 forks source link

regexps for images - add webp and remove $ #50

Open natocTo opened 6 years ago

natocTo commented 6 years ago

It is necessary to have $ at the end of reqex (I do not understand reqex to much)? Images with src like "/images/for.png?1716226" (some kind of cache busting) are not recognized now. Also I suggest add webp because is quite popular nowadays.

natocTo commented 6 years ago

Today I see I can override this setting. It is clever. But maybe this PR is still good to look.

Tjatse commented 6 years ago

This sounds reasonable, especially for the webpack things (vue.js, react...) or some configured nginx (/images/for.png?1716226 style is very common now.)

/\.(gif|jpe?g|png|webp)/i is okay but not safe, some url , e.g. /path/to/a.gif.cropper/1202 will be considered as an image.

/\.(gif|jpe?g|png|webp)(\?[^?]+)?$/i is better, test to pass: /path/to/some-image.png /path/to/some-image.jpg /path/to/some-image.png?1231312 /path/to/some-image.webp?a=c&d=1

I'll commit this after some necessary tests with my spiders, and release it ASAP!

Thanks a lot.