Closed jcshea closed 4 years ago
Thanks for starting the issue. Let's move on this. I think it's mainly an issue of defining a nice set of interfaces that we can work with, and then filling them with functionality for the individual enrichment APIs.
Our goal should be to expose this functionality both as a gRPC for Aleph, and as a command-line utility (which would likely run either against local CSV files, or simple backend like SQLite) for use in investigations like the Laundromat.
Let's think about the API for this, we can even do it in a test-driven way where we have a fake backend and a round of tests for it. Here's a little proposal:
a) We expect an entity
to be a dict. Inside the dict is an id
(which is a sha1 of something that uniquely identifies that entity), a schema
(which defines the FtM type of the entity) and a dict of properties
. Each entry in the properties
is a valid FtM model property (cf. followthemoney/schema/*.yaml
). It's value is a list of values.
b) Using this definition of an entity
, we can define some API classes:
class EnricherResult(object):
def __init__(self, score, entities):
self.score = score
self.entities = entities
class Enricher(object):
def enrich_entity(self, entity):
raise NotImplemented()
def enrich_query(self, query_string):
raise NotImplemented()
class OpenCorporatesEnricher(Enricher):
def enrich_entity(self, entity):
# query open corporates
if (entity['schema'] in ['Person', 'Company', 'LegalEntity', '...']):
# do a directors' search
for result in self.enrich_director(entity):
yield result
if (entity['schema'] in ['Company', 'LegalEntity']):
# do a company search
for result in self.enrich_company(entity):
yield result
def enrich_company(self, entity):
# make a query from entity['properties']['name'], entity['jurisdictions'] etc.
...
for res in response['companies']:
result = EnricherResult(res['score'], [])
ftm_company = self.align_oc_company(result)
result.entities.append(ftm_company)
for director in res['directors']:
ftm_director = self.align_oc_director(director)
result.entities.append(ftm_director)
ftm_directorship = self.align_oc_directorship(director)
result.entities.append(ftm_directorship)
yield result
Does that make sense as an API?
focusing on the other stuff at present, but took a break to push a stable version of the BvD class set for commentary. OC is similar.
note that proposed API->FTM mappings are in an excel file in the base dir.
Ok, thanks for pushing this! Bunch of feedback:
['Match_Score'] > 0.2
, not sure this is the right place to make that call..Directorship
, Ownership
etc. Check out one of the mappings for companies registry to get a sense of how these models work.Enricher 1.1 pushed.
newness is the first version of the OC enricher class and BvD results are now dicts that should align with FtM. Shareholders and GuO's in BvD are returned in separate dict lists and are mapped to entity/interval pairs. After looking at existing mappings and speaking to Jenya, I think this is the right play? else advise.
next steps: integrate datavault / FtM API results directly; move matching code out; get officers via OC (is this functionality restricted or something?... my company returns all lack an 'Officers' section in the json); draft code to write results to datavault.
re commentary:
Enricher 1.2 pushed. cancel the above question re OC Officer results -got that figured & incorporated...as well as corporate groupings... note that i need to find some results with guo not None to build that piece out.
as a general prioritization thought, i'm sort of assuming getting an enrichment run on the Entities Of The Hour and getting that written to DB is the priority?.. that as opposed to building the proper API / micro-service architecture. so if we can agree /stabilize an output format, we could run a round of enrichment for immediate use, no?
other misc thoughts: any appropriate spot for OC company filings?... seems like they'd be interesting where they exist.
Enricher 1.3 pushed.
building with the goal of enabling enrichment for structured DB entries of interest, so incorporated DB I/O as well as some table-structure prototyping for results.
been playing with some scheduling options as well. should be good for lowish-volume field testing into DB quite soon.
Enricher 1.4 pretty well rdy. BvD testing opportunities have been a little scarce, and as soon as i can iron out one more little wrinkle it'll be good to go and will push a stable bvd class (hopefully today).
As you can see from the align class functions and/or the example results (zzenrich*) my output proposal is to write FtM-aligned dicts separately into tables (or csv's for table output) for each retrieved schema, which are joinable by query_id.
Next steps: adapt current 'link' architecture to accept arbitrary matching models and write scores into db. the goal here would be to make the models modular and accept any matching model or models that would input selected entity=>match attributes (ie name & jurisdiction for each) and output a match probability
the idea for Pybossa would be in turn read to the links and scores and submit the links for judgement, using the model scores to prioritize judgement (v.high probability should simply be a series of confirmations, queries w no results above a probability threshold can be sent to the back of the queue)
notes on re-organizing the repo below:
corpint/extract : keep / incorporate as is
corpint/export: keep / incorporate as is
corpint/enrich:
corpint/model:
webui : beyond my expertise cli.py:
core.py :
Just for reference, this is now being developed here:
https://github.com/alephdata/followthemoney/tree/master/enrich
It's relatively stable working as a library or from the command line utility, so this may be a good time to think about how we'd want to include it in the platform.
re platform integration, here's my pitch (some elements previously discussed):
Build a revised version of corpint data enrichment that will be designed to automagically build company profiles for laundromat companies (as a first target). Features including :