Data Enrichment Strikes Back

jcshea commented 6 years ago

Build a revised version of corpint data enrichment that will be designed to automagically build company profiles for laundromat companies (as a first target). Features including :

Alignment of BvD & OC returns, where available (thinking complementary not overwriting)
Alignment with FTM schema, where not already in place (q for Pudo : is current corpint aligned?.. mostly?)
Model selection/filtering of primary-company api returns
Incorporation of corporate group api calls where available
Batch run scheduling with prioritization option for research requests (scheduled for off hours in the case of BvD) (work with Sunu to set this up?)

pudo commented 6 years ago

Thanks for starting the issue. Let's move on this. I think it's mainly an issue of defining a nice set of interfaces that we can work with, and then filling them with functionality for the individual enrichment APIs.

Our goal should be to expose this functionality both as a gRPC for Aleph, and as a command-line utility (which would likely run either against local CSV files, or simple backend like SQLite) for use in investigations like the Laundromat.

Let's think about the API for this, we can even do it in a test-driven way where we have a fake backend and a round of tests for it. Here's a little proposal:

a) We expect an entity to be a dict. Inside the dict is an id (which is a sha1 of something that uniquely identifies that entity), a schema (which defines the FtM type of the entity) and a dict of properties. Each entry in the properties is a valid FtM model property (cf. followthemoney/schema/*.yaml). It's value is a list of values.

b) Using this definition of an entity, we can define some API classes:


class EnricherResult(object):

    def __init__(self, score, entities):
        self.score = score
        self.entities = entities

class Enricher(object):

    def enrich_entity(self, entity):
        raise NotImplemented()

    def enrich_query(self, query_string):
        raise NotImplemented()

class OpenCorporatesEnricher(Enricher):

    def enrich_entity(self, entity):
        # query open corporates
        if (entity['schema'] in ['Person', 'Company', 'LegalEntity', '...']):
           # do a directors' search
           for result in self.enrich_director(entity):
                 yield result
       if (entity['schema'] in ['Company', 'LegalEntity']):
           # do a company search
           for result in self.enrich_company(entity):
                 yield result

    def enrich_company(self, entity):
          # make a query from entity['properties']['name'], entity['jurisdictions'] etc.
          ...
          for res in response['companies']:
                result = EnricherResult(res['score'], [])
                ftm_company = self.align_oc_company(result)
                result.entities.append(ftm_company)
                for director in res['directors']:
                      ftm_director = self.align_oc_director(director)
                      result.entities.append(ftm_director)
                      ftm_directorship = self.align_oc_directorship(director)
                      result.entities.append(ftm_directorship)
               yield result

Does that make sense as an API?

jcshea commented 6 years ago

focusing on the other stuff at present, but took a break to push a stable version of the BvD class set for commentary. OC is similar.

note that proposed API->FTM mappings are in an excel file in the base dir.

pudo commented 6 years ago

Ok, thanks for pushing this! Bunch of feedback:

If we're gonna use this repo, I wonder if we should move the old corpint code out of the way, e.g. into a "old/" folder. Just so we're clear which of the bits is part of the re-vamp, and what can go away in the end. I guess we want to keep the CLI in some form.
The client-side match scoring code is not going to live here. @Rinatius is currently making a stub home for it in aleph. Let's just return the score results from the respective APIs as they come in. This is just about reliably doing the API requests and translating stuff into FtM.
I also wouldn't make a random cut-off on ['Match_Score'] > 0.2, not sure this is the right place to make that call..
Regarding the FtM Mapping:
- I think company_type for OC goes into LegalEntity:legalForm
- Let's make new model fields for OC URI and BvD ID on LegalEntity
- There's a section down in the spreadsheet of FtM unmapped entities. Those would all be separate FtM entities, i.e. Directorship, Ownership etc. Check out one of the mappings for companies registry to get a sense of how these models work.

jcshea commented 6 years ago

Enricher 1.1 pushed.
newness is the first version of the OC enricher class and BvD results are now dicts that should align with FtM. Shareholders and GuO's in BvD are returned in separate dict lists and are mapped to entity/interval pairs. After looking at existing mappings and speaking to Jenya, I think this is the right play? else advise.

next steps: integrate datavault / FtM API results directly; move matching code out; get officers via OC (is this functionality restricted or something?... my company returns all lack an 'Officers' section in the json); draft code to write results to datavault.

re commentary:

reorganizing repo: willdo, once i get a little more of a handle on what everything does...
matching code: left it in for now... where does the stub live?
score cutoff: nbd but the cutscore there is the result of some misclassification-rate analysis that lives elsewhere... will update when the code has a proper place to live...
FtM Mappings : spreadsheet is updated. propose a 'thirdpartyId' field for OC_URI/BvD_ID

jcshea commented 6 years ago

Enricher 1.2 pushed. cancel the above question re OC Officer results -got that figured & incorporated...as well as corporate groupings... note that i need to find some results with guo not None to build that piece out.

as a general prioritization thought, i'm sort of assuming getting an enrichment run on the Entities Of The Hour and getting that written to DB is the priority?.. that as opposed to building the proper API / micro-service architecture. so if we can agree /stabilize an output format, we could run a round of enrichment for immediate use, no?

other misc thoughts: any appropriate spot for OC company filings?... seems like they'd be interesting where they exist.

jcshea commented 6 years ago

Enricher 1.3 pushed. building with the goal of enabling enrichment for structured DB entries of interest, so incorporated DB I/O as well as some table-structure prototyping for results.
been playing with some scheduling options as well. should be good for lowish-volume field testing into DB quite soon.

jcshea commented 6 years ago

Enricher 1.4 pretty well rdy. BvD testing opportunities have been a little scarce, and as soon as i can iron out one more little wrinkle it'll be good to go and will push a stable bvd class (hopefully today).
As you can see from the align class functions and/or the example results (zzenrich*) my output proposal is to write FtM-aligned dicts separately into tables (or csv's for table output) for each retrieved schema, which are joinable by query_id. Next steps: adapt current 'link' architecture to accept arbitrary matching models and write scores into db. the goal here would be to make the models modular and accept any matching model or models that would input selected entity=>match attributes (ie name & jurisdiction for each) and output a match probability the idea for Pybossa would be in turn read to the links and scores and submit the links for judgement, using the model scores to prioritize judgement (v.high probability should simply be a series of confirmations, queries w no results above a probability threshold can be sent to the back of the queue)

notes on re-organizing the repo below:

corpint/extract : keep / incorporate as is

loads csv & google sheets as input
add sql extract (already written test version in 'get_seed_companies.py')

corpint/export: keep / incorporate as is

exports to table (unwritten!)
exports to neo4j == todo: experiment with test entities
add sql export (already written test version in class definitions func write_to_db : will xfer here once restucture arranged)

corpint/enrich:

aleph.py : keep? ; n/a for spin enrichment.
gmaps.py : seems like it's probably unfinished? never used?
wikidata & wikipedia.py : not terribly useful in the majority of our context, i assume; never used?
opencorporates.py : obsolete; never used?
bvdorbis.py: obsolete

corpint/model:

mapping & project : structure for entity linking. keep / adapt.
emitter: generate db entries for *. adapt?
--- seems to me some of the functionality here is a bit superfluous?.. i think of the entities themselves owning the 'Origin' level associations, which will be recorded in the entity-tables, or the like, so these parameters aren't needed at the enricher level? similarly, i don't think generation of uid's at enrichment should be needed as entities should already have id's and api results will come with identifiers (bvd_id; OC_url; etc)
the rest : pre-FtM, so retire?

webui : beyond my expertise cli.py:

works with webui i'm assuming?

core.py :

config stuff : keep / adapt

pudo commented 6 years ago

Just for reference, this is now being developed here:

https://github.com/alephdata/followthemoney/tree/master/enrich

It's relatively stable working as a library or from the command line utility, so this may be a good time to think about how we'd want to include it in the platform.

jcshea commented 6 years ago

re platform integration, here's my pitch (some elements previously discussed):

rank user aleph queries group by timeN (say a week) from 'audit'
alternately give users an 'enrich' option for their company/person query or case file
queue from one or both of the above mechanisms for enrichment, first running the person/company classifier to infer schemas
notify users of enrichment results and prompt them to identify matches when viewing results, writing decisions to DB (only notify when a match is probable?)
once decided, enrichment res can be folded into the aleph entity (though i assume this will first be added as a function of the flask-research interface)
decisions can also be plugged directly into a match-model retraining loop, reducing the need (or at least the complexity/number of options) for human decisioning over time

alephdata / followthemoney

Data Enrichment Strikes Back #33