Closed pudo closed 4 years ago
Sounds like a very valuable thing to have! I have missed this conversation, but these are a couple of things that come to mind:
In CRIME I miss something that would encompass murder (deep in my own topic now).
I also wonder if we don't miss some neutral category such as CIVILIAN (?) and categories for 'our people' : ACTIVIST and/or JOURNALIST. This also depends on the use case, but leads me to
Doing a test with real data would surely lead to more questions.
@zufanka Great feedback! Let me respond point by point:
I've added journalists and activists and grouped things a bit more.
Would it make sense to use an existing controlled vocabulary that already does contain topics, such as Eurovoc, as a default topic list?
Or are you thinking of using some kind of machine learning technique (such as Latent Dirichlet Allocation for topic modelling) to detect topics and build the list automatically from the text?
I think Eurovoc is an interesting reference for this, thanks! I don't think it covers the same topics we're interested in perfectly, but it feels like it could be a valuable reference.
Regarding auto-detecting the topics, I am not sure what's feasible. There might be some heuristics-based stuff we could pull off quite easily, such as tagging all companies incorporated in a specific set of jurisdictions as offshore companies (easy for BVI, more controversial for the Netherlands or Delaware). But in general I would like to capture the domain expertise of our reporters more than trying to retroactively label some k-means result or such...
Let's close this now and add more topics as needed.
We've been discussing for a while to add
topics
as a controlled vocabulary to FtM. The idea is that while we have schema (Person
,Company
), these are very neutral and often don't capture the investigative semantics of an entity. In that sense, topics would be like adjectives: grammatical qualifiers on the knowledge graph.This draft list is based on the categorisation systems from a few leaked due diligence lists, with personal flourishes: