alephdata / followthemoney

Data model and processing tools for investigative entity data
https://followthemoney.tech
MIT License
218 stars 53 forks source link

Data Enrichment Strikes Back #33

Closed jcshea closed 4 years ago

jcshea commented 6 years ago

Build a revised version of corpint data enrichment that will be designed to automagically build company profiles for laundromat companies (as a first target). Features including :

pudo commented 6 years ago

Thanks for starting the issue. Let's move on this. I think it's mainly an issue of defining a nice set of interfaces that we can work with, and then filling them with functionality for the individual enrichment APIs.

Our goal should be to expose this functionality both as a gRPC for Aleph, and as a command-line utility (which would likely run either against local CSV files, or simple backend like SQLite) for use in investigations like the Laundromat.

Let's think about the API for this, we can even do it in a test-driven way where we have a fake backend and a round of tests for it. Here's a little proposal:

a) We expect an entity to be a dict. Inside the dict is an id (which is a sha1 of something that uniquely identifies that entity), a schema (which defines the FtM type of the entity) and a dict of properties. Each entry in the properties is a valid FtM model property (cf. followthemoney/schema/*.yaml). It's value is a list of values.

b) Using this definition of an entity, we can define some API classes:


class EnricherResult(object):

    def __init__(self, score, entities):
        self.score = score
        self.entities = entities

class Enricher(object):

    def enrich_entity(self, entity):
        raise NotImplemented()

    def enrich_query(self, query_string):
        raise NotImplemented()

class OpenCorporatesEnricher(Enricher):

    def enrich_entity(self, entity):
        # query open corporates
        if (entity['schema'] in ['Person', 'Company', 'LegalEntity', '...']):
           # do a directors' search
           for result in self.enrich_director(entity):
                 yield result
       if (entity['schema'] in ['Company', 'LegalEntity']):
           # do a company search
           for result in self.enrich_company(entity):
                 yield result

    def enrich_company(self, entity):
          # make a query from entity['properties']['name'], entity['jurisdictions'] etc.
          ...
          for res in response['companies']:
                result = EnricherResult(res['score'], [])
                ftm_company = self.align_oc_company(result)
                result.entities.append(ftm_company)
                for director in res['directors']:
                      ftm_director = self.align_oc_director(director)
                      result.entities.append(ftm_director)
                      ftm_directorship = self.align_oc_directorship(director)
                      result.entities.append(ftm_directorship)
               yield result

Does that make sense as an API?

jcshea commented 6 years ago

focusing on the other stuff at present, but took a break to push a stable version of the BvD class set for commentary. OC is similar.

note that proposed API->FTM mappings are in an excel file in the base dir.

pudo commented 6 years ago

Ok, thanks for pushing this! Bunch of feedback:

jcshea commented 6 years ago

Enricher 1.1 pushed.
newness is the first version of the OC enricher class and BvD results are now dicts that should align with FtM. Shareholders and GuO's in BvD are returned in separate dict lists and are mapped to entity/interval pairs. After looking at existing mappings and speaking to Jenya, I think this is the right play? else advise.

next steps: integrate datavault / FtM API results directly; move matching code out; get officers via OC (is this functionality restricted or something?... my company returns all lack an 'Officers' section in the json); draft code to write results to datavault.

re commentary:

jcshea commented 6 years ago

Enricher 1.2 pushed. cancel the above question re OC Officer results -got that figured & incorporated...as well as corporate groupings... note that i need to find some results with guo not None to build that piece out.

as a general prioritization thought, i'm sort of assuming getting an enrichment run on the Entities Of The Hour and getting that written to DB is the priority?.. that as opposed to building the proper API / micro-service architecture. so if we can agree /stabilize an output format, we could run a round of enrichment for immediate use, no?

other misc thoughts: any appropriate spot for OC company filings?... seems like they'd be interesting where they exist.

jcshea commented 6 years ago

Enricher 1.3 pushed. building with the goal of enabling enrichment for structured DB entries of interest, so incorporated DB I/O as well as some table-structure prototyping for results.
been playing with some scheduling options as well. should be good for lowish-volume field testing into DB quite soon.

jcshea commented 6 years ago

Enricher 1.4 pretty well rdy. BvD testing opportunities have been a little scarce, and as soon as i can iron out one more little wrinkle it'll be good to go and will push a stable bvd class (hopefully today).
As you can see from the align class functions and/or the example results (zzenrich*) my output proposal is to write FtM-aligned dicts separately into tables (or csv's for table output) for each retrieved schema, which are joinable by query_id. Next steps: adapt current 'link' architecture to accept arbitrary matching models and write scores into db. the goal here would be to make the models modular and accept any matching model or models that would input selected entity=>match attributes (ie name & jurisdiction for each) and output a match probability the idea for Pybossa would be in turn read to the links and scores and submit the links for judgement, using the model scores to prioritize judgement (v.high probability should simply be a series of confirmations, queries w no results above a probability threshold can be sent to the back of the queue)

notes on re-organizing the repo below:

corpint/extract : keep / incorporate as is

corpint/export: keep / incorporate as is

corpint/enrich:

corpint/model:

webui : beyond my expertise cli.py:

core.py :

pudo commented 6 years ago

Just for reference, this is now being developed here:

https://github.com/alephdata/followthemoney/tree/master/enrich

It's relatively stable working as a library or from the command line utility, so this may be a good time to think about how we'd want to include it in the platform.

jcshea commented 6 years ago

re platform integration, here's my pitch (some elements previously discussed):