alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.04k stars 272 forks source link

Documents inexplicably orphaned #1581

Open mattcg opened 3 years ago

mattcg commented 3 years ago

This is a bit difficult to reproduce and I have tried debugging and gotten nowhere. Periodically, some documents that are deep within a directory hierarchy will appear, as copies of the original documents but orphaned from the parent directory, at the root of the dataset directly. After deleting these orphan documents, some event - a re-index, re-ingest or upgrade - seems to trigger their re-appearance.

In other instances, these documents are not actual documents but empty 'Table' documents. Again, when deleted they re-appear. If were to guess I'd imagine it's some race condition - attempting to index the child before the parent document is indexed, but this is just an uneducated guess.

pudo commented 3 years ago

Can you debug it and submit a patch, please?

mattcg commented 3 years ago

Yes, will do so!

mattcg commented 3 years ago

I was able to replicate this after re-indexing a very large dataset. It's certainly a bug; I just haven't discovered the cause yet.

pudo commented 3 years ago

If you have any lead on what the parent document of the stray fragment might be (and ideally share it), that would help us debug it.

sunu commented 2 years ago

This might have been related to https://github.com/alephdata/aleph/issues/3923

pudo commented 2 years ago

What a debug find, @sunu. This one had been killing me for ages. That's a super logical explanation....

mattcg commented 2 years ago

Wow sunu, incredible find! Yes, this is definitely the reason. It also explains a problem we were constantly facing, of Tables showing up in search results without a parent document that could be downloaded.