alephdata / followthemoney

Data model and processing tools for investigative entity data
https://followthemoney.tech
MIT License
214 stars 50 forks source link

Topics #163

Closed pudo closed 4 years ago

pudo commented 4 years ago

We've been discussing for a while to add topics as a controlled vocabulary to FtM. The idea is that while we have schema (Person, Company), these are very neutral and often don't capture the investigative semantics of an entity. In that sense, topics would be like adjectives: grammatical qualifiers on the knowledge graph.

This draft list is based on the categorisation systems from a few leaked due diligence lists, with personal flourishes:

CRIME
    FRAUD
    CYBER
    FINANCIAL
    THEFT
    WAR
    TERROR
    TRAFFICKING
        DRUGS
        HUMANS
        SMALL THINGS MADE FROM STRAW
CORPORATE
    OFFSHORE
GOVERNMENT
    NATIONAL
    STATE
    MUNICIPAL
    SOE (STATE OWNED)
    PARTY
FINANCIAL SERVICES
    BANK
    FUND
ROLES
    POLITICIAN
    ASSOCIATE
    JUDGE
    DIPLOMAT
    LAWYER
    SPY
    JOURNALIST
    ACTIVIST
UNION
MILITARY
RELIGION
SANCTIONING
    FROZEN
    BANNED
    INVESTIGATED
zufanka commented 4 years ago

Sounds like a very valuable thing to have! I have missed this conversation, but these are a couple of things that come to mind:

  1. who would chose which topics to use? Would this be a manual process?
  2. what purpose, use cases would they serve? (not doubting there is one)
  3. would the topics be updated somehow? Or will they be a point in time (have a date)?

In CRIME I miss something that would encompass murder (deep in my own topic now).

I also wonder if we don't miss some neutral category such as CIVILIAN (?) and categories for 'our people' : ACTIVIST and/or JOURNALIST. This also depends on the use case, but leads me to

  1. If we are to assign these topics, is it ok for entities to have none?

Doing a test with real data would surely lead to more questions.

pudo commented 4 years ago

@zufanka Great feedback! Let me respond point by point:

  1. I think in personal datasets, it'd be the user -- otherwise they'd be part of the mapping.
  2. Just adding meaningful tagging to the system, make for great facets to filter the data.
  3. If we have that info we can usually model it more specifically (e.g. as Memberships etc)
  4. Absolutely, that would probably be 99% of entities.

I've added journalists and activists and grouped things a bit more.

augusto-herrmann commented 4 years ago

Would it make sense to use an existing controlled vocabulary that already does contain topics, such as Eurovoc, as a default topic list?

Or are you thinking of using some kind of machine learning technique (such as Latent Dirichlet Allocation for topic modelling) to detect topics and build the list automatically from the text?

pudo commented 4 years ago

I think Eurovoc is an interesting reference for this, thanks! I don't think it covers the same topics we're interested in perfectly, but it feels like it could be a valuable reference.

Regarding auto-detecting the topics, I am not sure what's feasible. There might be some heuristics-based stuff we could pull off quite easily, such as tagging all companies incorporated in a specific set of jurisdictions as offshore companies (easy for BVI, more controversial for the Netherlands or Delaware). But in general I would like to capture the domain expertise of our reporters more than trying to retroactively label some k-means result or such...

pudo commented 4 years ago

Let's close this now and add more topics as needed.