alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
621 stars 210 forks source link

Pipeline: image is a marketing cdn #10623

Open CloCkWeRX opened 2 weeks ago

CloCkWeRX commented 2 weeks ago

In #10621 there are many instances of:

https://dynl.mktgcdn.com/p/vfVdaxyP_jDZdsuJUp6zjXXz7jkJ87XeGPMuAnvDNYU/150x150.png

Blacklisting that CDN or URL pattern seems like it would be a good catchall.

matkoniecz commented 2 weeks ago

I suspect that real venue-specific images could also be delivered via CDN :(

matkoniecz commented 2 weeks ago

Though https://dynl.mktgcdn.com repeats awfully lot in #10630

davidhicks commented 2 weeks ago

I've yet to find a useful picture hosted at mktgcdn.com -- do any exist?

We could create a simple ImageValidationPipeline pipeline class to raise a warning in the following conditions:

Cj-Malone commented 2 weeks ago

https://dynl.mktgcdn.com/p/ECzcUdFRYZh4eYh4BKvyLPfmBTaV6uuMBv0PACHqIno/619x614.png

CloCkWeRX commented 2 weeks ago

Perhaps some kind of images_checked = True to suppress the banned CDNs, given the low hit rate?

Alternatively for mktgcdn ban all square pictures under 300 x 300px?

I think blacklisting data:URLs might be worthwhile as well, as they could be legit pictures but they are likely content you wouldn't publish to OSM