Potential NEW datasets - Githubissues

Remake: Maybe, adding via email domain? Eco-stylist: no, would have to scrape not api seemingly present Ethical.net: could easily scrape, seemingly on one page and structured nicely, but looks sorta out of date. Ethicalconsumer looks like it would take quite a bit of effort to do, for little gain Change.org, almost certainly not. Iffy quotient I can’t see a way to actually get data, only a white paper on implementing it, and feels a bit like the botometer in that it works best in aggregate (if you consider it working at all).

We can use allsides data as its under CC-BY-NC, it would be good to put in the request to get it as CSV/JSON and perhaps API access if possible. And a quick browse shows a lot aren’t linked to the actual site consistently

Adfonesmedia’s bias chart is very interesting, I got the data, there is a lot of junk and it is mostly article-wise. Can probably be implemented but will need some work to get it to filter right

BVDinfo just offers me a job, I can’t see no info

Zoom info I do not understand how to use without agreeing to / signing up, ah I got in, will need extensive adversarial scraping like glass door, I got stopped 3 times to ask if I was a robot

DNB is absolutely nuts with how well protected data is, however the data is beautiful so its understandable. Basically going to take quite a lot to get anything out of it I think.

Powrbot looks like a frontend to something else, pages often have wiki style reference markers but no references, similar titles to wikipedia lol https://powrbot.com/companies/profile/mediacom/ also requires a login for api access, probably scrapeable but honestly not the best quality.

GlobalData uses some proprietary ID, and limits page views. “repID” and “companyId” https://www.globaldata.com/TreeFilter/GetCompanySnapshotFinancials/?repId=A249A&companyId=10196 Almost certainly going to be a problem to scrape. Need an account to have “unlimited page views”, potentially could scrape slowly with an account and then use those IDs if they are unchanging, but its going to be hard to connect it back to a url (unless that gets added by having a login)

Wikirate is organised not url wise but does tag wikipedia articles (although by ID number, not by name), if the data was more concise or could be shown interestingly and usefully it would be worth the effort to scrape.

Companies house data is probably useless to us, and its not easily linked unless the company already has the company number listed in Wikidata. Not sure what data we would display, perhaps the mortgages data? Not super sure that’s even useful. And officers/partners would be interesting but if the Wikidata has the companies house number it will likely have that data already (lets be honest about that).

Offshore leaks is super interesting, its not really company or website wise, I think if we implement it it should be person wise and cumulative, (i.e. this company is associated with X people involved in X offshore leak). But Honestly even that I struggle to see a good way of presenting.

Bloomsbury fashion central, I’m not super sure what the data would be that we’d take, let alone how we would link it.

Open sanctions holy shit I am so so so so so impressed by how clean and well constructed and open the site is. Most of it is unusable because it can’t be linked, however the Wikidata linked people could probably be used for something. Particularly showing aggregate if people have been involved in something see “Topics” on https://www.opensanctions.org/search/?scope=wikidata&schema=Person, how often this data will come up is uncertain, probably will be useful at a certain level of scale of linking (I.e. the more linking of data to Wikidata the more likely that you’d see a hit, but still I’m uncertain). Super fucking cool though.

Opencorperates could probably give us a better graph (in terms of number of subsidiaries, or for augmentation and getting hierarchies). “Open” doesn’t mean free however, and bulk data is behind a paywall, so anything interesting is going to either take a while to scrape or cost money for access. We should carefully consider what and how to use it.

Journalised doesn’t exist anymore which SUCKS because it sounds really cool for us but REALLY BAD for journalists (it sounds like it could be fuel to fire harassment and mass reporting campaigns).

Not sure what I’m looking for on Hathi trust.

HAH! Crunchbase’s extension is a draw like the designer made for us! Extension searches domain and gets a “score” which I think is probably some form of confidence:

curl 'https://www.crunchbase.com/v4/data/bulk_matches' \ --data-raw '{"matching_entities":[{"domain":"www.facebook.com"}]}' \ --compressed

{ "matched": [ { "input_domain": "www.facebook.com", "cb_url": "https://www.crunchbase.com/organization/facebook", "score": 2.5, "input_name": null, "uuid": "df662812-7f97-0b43-9d3e-12f64f504fbb" } ], "unmatched": [] }

Uuid is then used to grab the data we’d want: curl 'https://www.crunchbase.com/v4/data/entities/organizations/df662812-7f97-0b43-9d3e-12f64f504fbb?field_ids=%5B%22location_identifiers%22,%22num_employees_enum%22,%22last_funding_type%22,%22funding_total%22,%22ipo_status%22,%22rank_org_company%22,%22identifier%22%5D&card_ids=%5B%22about_short_description%22,%22social_fields%22,%22overview_timeline%22,%22org_similarity_list%22%5D' \ --compressed

We would probably like the employee numbers and funding total. On the organisation page the investments could is somewhat interesting as well as the “Hub tags”

FMD is basically just contact info + some times a rating from MARS (a defunct rating site)

Politifact is more on a fact by fact basis it seems and totals up those facts, somewhat useful but SUPER US-centric. I imagine a lot of data will be spread quite evenly across the score card and file drawer effect in-effect a lot more than other sources.

EU Transparency register is VERY interesting, but in an interesting to me only sort of way https://ec.europa.eu/transparencyregister/public/consultation/displaylobbyist.do?id=498852846811-94 Not sure what data to present from it BUT it would be good to link out to the registrations. The “Goals/remit of organisation” is probably the most relevant section followed by Fields of interest and annual costs (depending on what it actually means).

OpenSecrets is crazy! Its data is hosted on google sheets!!!!!!!!!!!!!!!! Oh that just seems to be the feature datasets, phew. The api looks interesting but super US centric, probably more useful if we were person-wise rather than company. If we have a orgID we could get a few interesting info pacs, contributions, lobbying $s, party leaning etc. https://www.opensecrets.org/api/?method=orgSummary&output=doc Limited to 200 calls a day, I can’t imagine we would need to update it often though. We can apply for bulk data access, but they might have an issue with republishing and the fact we aren’t US based https://www.opensecrets.org/open-data/bulk-data-documentation https://www.opensecrets.org/bulk-data/signup

Crossref is interesting. Probably would work best linking out with a total studies funded (if possible). Can’t get the search to work right now but I imagine I’m just using it incorrectly right now (or internet is messing with me).

God I love littlesis, its so ugly and functional. Another good source if we reengineered the graph to be not based on Wikidata.

MAVISE EAO, Not useful. OpenDOAR, I don’t think is useful either. IANA and WHOIS isn’t useful beyond “Wow these 10 companies have a top level domain, what’s a top level domain”.

MNOPEDIA, i.e. Invisible Voice the Minnesota edition.

Pitchbook appears to be the same as DNB?????? Just reformatted and using more interesting and usable IDs with less useful information available without login :( https://pitchbook.com/profiles/company/11919-79

In conclusion, suprised at how many "open" things arent open. Only a couple are usable for us that are low-hanging fruit, a few we could probably request access, nothing super juicy unless we reengineer the graph.

ixt commented 1 year ago

http://caligraph.org/explore.html <- might be a more useful front to dbpedia

ixt commented 1 year ago

https://clearbit.com/logo <- logos by domain also of interest from them is: https://dashboard.clearbit.com/docs#name-to-domain-api https://dashboard.clearbit.com/docs#enrichment-api

InvisiblePlatform / rosetta

Potential NEW datasets #76