alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2k stars 267 forks source link

BUG: Entities automatically generated from mentions in an investigation during cross reference #2994

Open tom-claessens opened 1 year ago

tom-claessens commented 1 year ago

Describe the bug When cross-referencing an investigation with a multitude of uploaded documents in Aleph, the mentions within the documents are automatically conversed into entities.

To Reproduce Steps to reproduce the behavior:

  1. Create an investigation
  2. Upload a set of documents within the investigation, wait for them to be indexed and ingested
  3. Start cross-reference
  4. Mentions from will now be automatically be added to the entities within the investigation

Expected behavior Ideally, cross-referencing should happen without generating entities within the investigation. So that the cross-referencing exists of:

Aleph version Latest version. Problem is encountered within the Aleph instance of Follow the Money (NL).

Screenshots Example of unwanted generated entities Screenshot from 2023-04-12 13-08-25 lem.

tillprochaska commented 1 year ago

Hi @tom-claessens, thanks for opening this issue!

  1. Start cross-reference
  2. Mentions from will now be automatically be added to the entities within the investigation

Just to clarify, when you say "cross-reference" did you only trigger the automatic cross-reference process by clicking the "compute" button in the cross-referencing section? Or did you also manually rate the corss-referencing results ("Same"/"Unsure"/"Different")?

tom-claessens-ftm commented 1 year ago

Hi @tillprochaska ,

I think it happened in both situations. I'm not entirely sure, as it is both something I've encountered, but also my colleagues. I think most of us are not very tempted to rate all cross-reference results, as sometimes there are thousands of results to rate. Does this mean that Aleph is supposed to add new entities from the manually chosen "sames" from the cross-reference results?

tillprochaska commented 1 year ago

I will need to reproduce the issue and get some more information from others as I'm not super familiar with the feature. If this is only happening for xref matches that are rated manually I could imagine that this is intended behavior. I'll geht back to you when I have more information.

tillprochaska commented 1 year ago

I have been able to reproduce this issue:

  1. Uploaded a PDF document that contains names of companies
  2. Waited for Aleph to finish processing the document.
  3. Viewed the document and ensured Aleph had extracted the names of the companies as mentions.
  4. Navigated to the XREF section and manually triggered XREF.
  5. Waited for the XREF to complete.
  6. Searched for schema:Company within the investigation.
  7. The search results include the mentions extracted from the document.

When viewing these entities, you can actually see that they are still linked to the source document using the companiesMentioned/mentionedBy properties:

Screen Shot 2023-04-18 at 17 10 57

For further debugging, these logs may help finding the relevant parts of the source code that trigger this behavior. Note that "[Test] Entities generated from mentions" is the title of the investigation I created for testing.

Screen Shot 2023-04-18 at 17 15 32
tillprochaska commented 1 year ago

Additional context from @brrttwrks:

I too was able to recreate it, but only for investigations. I did not see the same behavior for datasets.

For the dataset, I did not get any results from the xref, nor were any extracted entities 'reified'.

Firstly, I think the behavior should be the same. At least that is my expectation. That it isn't happening for datasets means that xref for leaks and many of our bigger investigations are missing possible matches.

Also, if, in datasets, Aleph is automatically creating actual entities, then xref won't work from just one side, but in both directions. However, this means that datasets mentions aren't matched currently in either direction

tillprochaska commented 1 year ago

I was able to confirm that the current behavior is indeed intended. It was implemented some time ago as an "experiment" with the expectation that there would be more iterations to refine the feature in the future, but that never happened.

The idea behind it was the following: When Aleph extracts mentions of names from a document and is then able to find similar Person/Company entities in other datasets (e.g. in a companies registry or census database), it is likely that that name is the name of a person or company, respectively.

We do however understand that the current behavior is confusing and inconsistent and can lead to cluttered investigations and will consider adjusting or removing the behavior.

tillprochaska commented 2 months ago

One additional small detail I just observed:

When cross-referencing a collection with mentions, entities are created as outlined in this thread. When I then delete the entity that was automatically created, the respective cross-referencing match is deleted as well (makes sense). When I re-run the cross-referencing, the mention is ignored, i.e., two cross-referencing runs with the same data lead to different results.