alltheplaces / alltheplaces

A set of spiders and scrapers to extract location information from places that post their location on the internet.
https://www.alltheplaces.xyz
Other
607 stars 204 forks source link

Maybe mention in readme that ATP takes data from first part websites? #8790

Open matkoniecz opened 2 months ago

matkoniecz commented 2 months ago

And that spider taking data from OSM/Google/Yelp etc would not be ok?

It seems to be not specified clearly anywhere and it is relevant for potential users of this data.

iandees commented 2 months ago

This distinction shouldn't matter to data consumers. The data produced by All the Places is licensed CC-0 and is free to use by whomever.

I think the spirit of the project is fairly well defined with existing documentation that says stuff like "A project to generate point of interest (POI) data sourced from websites with 'store location' pages." and "A set of spiders and scrapers to extract location information from places that post their location on the internet."

In practice, we won't merge a PR that scrapes location data from Google, Yelp, or other commercial data aggregators.

matkoniecz commented 2 months ago

This distinction shouldn't matter to data consumers. The data produced by All the Places is licensed CC-0 and is free to use by whomever.

Spider extracting data from noncommercial agregator that is not CC-0 licensing is data (like OpenStreetMap) that would be labelled CC-0 would be in fact problematic in at least some jurisdictions

matkoniecz commented 2 months ago

In practice, we won't merge a PR that scrapes location data from Google, Yelp, or other commercial data aggregators.

That is good news, but it is not necessarily obvious to potential data users and some clear promise that this status would continue would be nice

davidhicks commented 1 month ago

In practice, we won't merge a PR that scrapes location data from Google, Yelp, or other commercial data aggregators.

That is good news, but it is not necessarily obvious to potential data users and some clear promise that this status would continue would be nice

ATP already adds dataset_attributes = {"source": "api", "api": "example.net"} for data obtained from store finder APIs that many brands use. For ATP consumers in jurisdictions that care about https://en.wikipedia.org/wiki/Database_right , such consumers could use dataset_attributes for more granular information on where data for each spider was sourced from. However I would have thought these "database rights" to anyone that cares would apply to all spiders, regardless of whether an API was used, a sitemap was crawled and each page contains structured data, etc. Hence these ATP consumers would be needing to contact 2500+ brands around the world to sign legal agreements with each to use location data published publicly be the brand on their websites?

Copyright licensing of crawl results generated by ATP is not applicable because there is no creative input in the factual data ATP captures and outputs.

davidhicks commented 1 month ago

This distinction shouldn't matter to data consumers. The data produced by All the Places is licensed CC-0 and is free to use by whomever.

Spider extracting data from noncommercial agregator that is not CC-0 licensing is data (like OpenStreetMap) that would be labelled CC-0 would be in fact problematic in at least some jurisdictions

The question is whether such data required any creative process to generate, or whether it's just factual data. OSM is probably considered creative when you start looking at the aggregation of shapes and lines which have been creatively drawn to depict physical features on the ground.

1000x "Somewhere close to these coordinates X,Y is a McDonalds restaurant" is a fact not requiring creative process to generate, and copyright is therefore not applicable?

1000x "This combination of shapes and lines depicting a McDonalds restaurant including drive-through area, car park, landscaping, building footprint, etc" is probably creative and therefore copyright is applicable?

davidhicks commented 1 month ago

In practice, we won't merge a PR that scrapes location data from Google, Yelp, or other commercial data aggregators.

Some ATP spiders do take data from Google/Yelp/etc APIs and web pages, but only where the data is authoritative (primary source). For example, some brands outsource their store finder pages to Google via Google Places API, or use Yext APIs for their store finder pages. It's the brand who have supplied their own location data to Google or Yext or whoever to display to users using an off-the-shelf mapping/store finder system.

Some ATP spiders also scrape aggregate data (e.g. worldwide airports, money transfer/ATM locations), although this type of scraping is more rarely done because the risk of secondary/tertiary data sources being inaccurate.

matkoniecz commented 1 month ago

However I would have thought these "database rights" to anyone that cares would apply to all spiders, regardless of whether an API was used, a sitemap was crawled and each page contains structured data, etc.

OSM LWG analysis indicates that database rights do not apply if it is collecting own shop data but would apply for example for OSM data, Yelp data or Google data

1000x "Somewhere close to these coordinates X,Y is a McDonalds restaurant" is a fact not requiring creative process to generate, and copyright is therefore not applicable?

Database rights are still applicable if they exist in relevant jurisdiction, in jurisdictions you also have "sweat of the brow" thing (see UK where both apply)

matkoniecz commented 1 month ago

Some ATP spiders do take data from Google/Yelp/etc

Is there some way to find them? I hope these are also usable in OSM but I would want to double check

matkoniecz commented 1 month ago

(more or less I want to triple check that it has no problem like OpenAddresses project where at least some data has various usage/compatibility issues which are not stated ahead, like that "no spamming" Australia dataset which - while made with good intentions - makes it incompatible with many datasets, including OpenStreetMap

And I prefer to check it fully before I will start that high volume data imports and enabling ATP data use directly in OSM editors, I prefer to avoid strategies on copyright and copyright-like as run by internet Archive)

davidhicks commented 1 month ago

Some ATP spiders do take data from Google/Yelp/etc

Is there some way to find them? I hope these are also usable in OSM but I would want to double check

Google uses can be found generally with grep "google\.com" *.py on the spiders directory.

Most of these are generally parsing Google URLs of the regex ^https:\/\/www\.google\.com\/maps\/d\/embed\?mid=\w+ where the resulting pages have an embedded JavaScript blob of simple point data that is extracted. In these instances, the brand is uploading their own data to Google Maps and using Google Maps to host a map viewer for the brand's website. Google is then possibly trying to claim copyright over the brand's uploaded data, which in jurisdictions such as the US where Google is headquartered, appears to be a nonsense claim per https://en.wikipedia.org/wiki/Feist_Publications,_Inc.,_v._Rural_Telephone_Service_Co. (amongst other cases). I don't see anything special about these spiders listed above versus the hundreds that use the numerous storefinder classes in /locations/storefinders/* involving third party hosted storefinder APIs.

Yelp I got mixed up with Yext. I can't find any mention of Yelp APIs being used for any spider.

davidhicks commented 1 month ago

However I would have thought these "database rights" to anyone that cares would apply to all spiders, regardless of whether an API was used, a sitemap was crawled and each page contains structured data, etc.

OSM LWG analysis indicates that database rights do not apply if it is collecting own shop data but would apply for example for OSM data, Yelp data or Google data

1000x "Somewhere close to these coordinates X,Y is a McDonalds restaurant" is a fact not requiring creative process to generate, and copyright is therefore not applicable?

Database rights are still applicable if they exist in relevant jurisdiction, in jurisdictions you also have "sweat of the brow" thing (see UK where both apply)

Not that I think the distinction matters for point data/address data that is factual, but ATPs limited use of Google Maps storefinder pages for brands is different from what OSM folk may be assuming happens when some ATP spiders extract some data from "Google Maps". The ATP spiders aren't hitting up the Google Maps API with a brand's API key and extracting point data directly. ATP spiders are instead browsing to a URL setup specifically by the brand for a Google-hosted map view (iframe) of the brand's own supplied data. And then the spider just reads the results embedded within the page that is returned.

I thought most of OSM's concern with relying on Google Maps location information, rather than being a concern of copyright or database rights for European users, was that the data is generally not very accurate because it's often address data crawled off the Internet by Google and then geocoded to coordinate points with questionable accuracy, with the data often being stale with locations long closed appearing on Google Maps?

For UK there is case law https://en.wikipedia.org/wiki/Football_DataCo#Court_case_results and https://www.vennershipley.com/insights-events/originality-in-copyright-a-review-of-thj-v-sheridan/ which seems to align the UK more with countries such as the US and Australia as not recognising copyright exists for factual data, other than perhaps the creative display of said data (e.g. particular grouping/ordering/formatting of location data in a map or table).

Here in Australia we still get companies and even government departments try to claim copyright over compilations of factual data despite case law demonstrating a history of these claims not being enforceable when courts decide over it. Hence companies/government departments have then tried instead to rely upon people signing contracts before providing access to data, and then general contract law applies and the customer for the data can be pursued for damages in civil proceedings. Some companies/government departments still want to display data and make it available publicly AND also pretend they can fully control it and charge some people but not others for access to the data made publicly available. Hence some have also tried to claim people have accepted a contract just by browsing to a website, which again is nonsense with case law demonstrating a contract needs a user to actually read and perform an action to accept the contract.

matkoniecz commented 1 month ago

ATP spiders are instead browsing to a URL setup specifically by the brand for a Google-hosted map view (iframe) of the brand's own supplied data. And then the spider just reads the results embedded within the page that is returned.

how can we check or confirm or guess whether for example https://www.google.com/maps/d/u/0/kml?mid=17KgQXKUbt-foi_HwjRewevjQtKwwkz1d&lid=fbhMYyAMWfQ&forcekml=1 ( used by ATP in https://github.com/alltheplaces/alltheplaces/blob/master/locations/spiders/terrible_herbst.py ) is "brand's own supplied data" rather than own Google data?

(maybe it has a good answer but right now if I would be asked "why your ATP-based tools takes Google maps data" I would have no good answer - and I prefer to have it, also for myself...)

I thought most of OSM's concern with relying on Google Maps location information, rather than being a concern of copyright or database rights for European users, was that the data is generally not very accurate because it's often address data crawled off the Internet by Google and then geocoded to coordinate points with questionable accuracy, with the data often being stale with locations long closed appearing on Google Maps?

copyright/copyright like restrictions such as database rights is a major problem here, it is not primarily about quality

POI data quality is not ideal in Google but overall it is quite great and if we were allowed to mass import it, then I think that in many areas if not entire Europe would be really happy to import them

I bet that across many areas if we would have choice between Google Maps quality POI data and OSM quality of POI data then taking first choice would be done nearly universally.

matkoniecz commented 1 month ago

To clarify what I am worrying about here - lets present a theoretical case.

There is a shop chain in USA. Some UK company, entirely separate from one that operates shop created a database of their locations.

As I understand following may be true:

In such case following can happen:

But note "this database is protected by database rights in UK and cannot be just copied"! I would want to avoid such scenario, without dropping my plan of making ATP-OSM matcher and making it available for OSM mappers.

(the same can happen in multiple variations)

davidhicks commented 1 month ago

@Cj-Malone has proposed PR #9075 which records the domain from which a POI was extracted. This could be used to check for domains of ".co.uk" (or whatever country of hosting is desired to be filtered out by users of the extracted data). Does PR #9075 address this feature request?

matkoniecz commented 1 month ago

It would definitely help in monitoring/detecting of problematic data but...

1) It is normal for say Polish company or organisation to use .com domain rather .pl domain and expect this to hold elsewhere, so filtering based on domain names is not enough

2) It does not limit ATP to sources covered by LWG decision so people wishing to import ATP data into OSM would need to continuously monitor/filter it

3) May also have false positives - see terrible_herbst case in https://github.com/alltheplaces/alltheplaces/issues/8790#issuecomment-2254154644 (is it using Google data or OK data? If this data is actually acceptable then it would be overzelously skipped)