Open mal-tee opened 1 year ago
Thank you for opening this issue (based on my email).
Should we turn this into a test? @baltpeter
I haven't looked into that particular case yet. Are we sure that that is a mistake?
But, either way, we can't generally forbid two records having identical runs
entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:
https://github.com/datenanfragen/data/blob/master/companies/amazon-de.json https://github.com/datenanfragen/data/blob/master/companies/amazon-es.json
I haven't looked into that particular case yet. Are we sure that that is a mistake?
Haven't looked either. :sweat_smile:
But, either way, we can't generally forbid two records having identical
runs
entries. There are already valid records where that is the case, e.g. the Amazon records for different companies:
master
/companies/amazon-de.jsonmaster
/companies/amazon-es.json
Yeah, we should only do that test if there is no overlap in the countries. :thinking:
Yeah, we should only do that test if there is no overlap in the countries. thinking
If there is overlap in the countries, you mean, right?
But even then, I'm not sure whether there can never be a case where that is valid…
If there is overlap in the countries, you mean, right?
Yes, oops.
I wrote a little script to implement this:
from collections import defaultdict
import os
import json
hashmap = defaultdict(list)
for file in os.listdir("companies/"):
with open("companies/" + file, "r") as f:
company = json.load(f)
slug = company["slug"]
hashmap[company["name"]].append(slug)
if "runs" in company:
for run in company["runs"]:
hashmap[run].append(slug)
simple_overlap = {k: v for k, v in hashmap.items() if len(v) > 1}
print("simple", len(simple_overlap.keys()))
for name, slugs in simple_overlap.items():
used_rvs = defaultdict(list)
alls = set()
for slug in slugs:
with open("companies/" + slug + ".json", "r") as f:
company = json.load(f)
if "relevant-countries" in company:
if company["relevant-countries"] == ["all"]:
alls.add(name)
else:
for rv in company["relevant-countries"]:
used_rvs[rv].append(slug)
filtered_overlap = {k: v for k,v in used_rvs.items() if len(v) > 2 or name in alls}
if(filtered_overlap):
print(name, filtered_overlap, alls)
simple 38
REWE Markt GmbH {'de': ['rewe-shop']} {'REWE Markt GmbH'}
Ideawise Limited {'de': ['gay-de', 'fetisch-de', 'poppen-de', 'kaufmich-com']} set()
Seven.One Entertainment Group GmbH {'de': ['sat1gold', 'prosieben', 'kabeleinsdoku', 'kabeleins']} set()
cpx online active AG {'de': ['optivel'], 'ch': ['optivel'], 'fr': ['optivel'], 'at': ['optivel']} {'cpx online active AG'}
Ingenico Payment Services GmbH {'de': ['ingenico-de']} {'Ingenico Payment Services GmbH'}
Ingenico Healthcare GmbH {'de': ['ingenico-de']} {'Ingenico Healthcare GmbH'}
Yeah, we'd also have to check if the websites are different. And probably every other key as well.
However, we can close this issue: The rewe group collision is okay, since the webpages are different.
I see my original concern as unresolved. The database currently shows 2 officials for REWE Markt GmbH:
As I understand it, this cannot be the case, as the unambiguity is missing. Which sources indicate that REWE Zentralfinanz eG is also responsible for REWE Markt GmbH? I have not been able to verify this so far.
Both have "Rewe Markt GmbH" in the
runs
-Array. Seems like a mistake we should resolve?