alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.03k stars 272 forks source link

federated search and entity sharing between aleph instances #105

Closed jbothma closed 6 years ago

jbothma commented 8 years ago

It'd be nice if we didn't need to ingest the same documents in multiple aleph instances.

I think this platform and UX can handle a bit of asynchrous behaviour for collecting results from slow-responding servers that aren't being queried directly.

At a minimum, it'd be cool to get search result summaries and view links for documents in other aleph instances.

Further it'd be cool if entities extracted in different servers could work well together.

jbothma commented 8 years ago

@sstrigler are there any tips you'd give to learn from XMPP federation so we don't totally re-invent the wheel?

In an aleph instance I might authenticate and that identity could be granted access to private collections of documents. It'd be nice if I would have access to these private documents when I search for a document on another aleph instance with some federation set up with the one where I have extra privileges.

pudo commented 8 years ago

cc @mattcg regarding this --

sstrigler commented 8 years ago

In XMPP you'd use PubSub nodes for this where you could grant fine grained access to items published there. Each server handles a) authentication of the users registered and connecting there and b) authorization for its own pubsub nodes. Federation is based on trust that each member of the federation works correctly, ie handles authentication correctly. Trust between servers is established through a mix of trust in DNS and a dialback operation. That's the legacy way. There's also a new approach leveraging DNSSEC but not widely deployed yet due to lack of support through ISPs. Dialback basically works like this: Server A establishes a TCP/IP connection over which it communicates its wish to federate, then the called server establishes a new one back to the originating server using DNS lookup for the communicated domain. It's described in detail at http://xmpp.org/extensions/xep-0220.html

pudo commented 8 years ago

I'm fairly flexible on this, but I see two fundamental scenarios:

a) Federated search. There's an ongoing debate within the investigative journalism tech community over creating some sort of meta-search protocol, but it's come to a halt. At the moment, @ICIJ are building out their DataShare project, which (last I heard) was supposed to allow decentralised document search using some sort of DHT. Given these developments, I'd seriously consider adopting their protocol, rather than building our own.

In any case, federated search will be fairly complex if it is supposed to be seamless. For example, facet counts and complex filters (e.g. for entities) would need to either be hidden in the results for federated queries, or part of the protocol (and aggregated on the client). Using a protocol that's been primarily designed for investigative reporters will make this weirder still, because there's a lot of awkward security stuff that's likely to end up in there.

b) Aleph-to-Aleph crawling. We've already got the crawler framework, and it would be cool for one Aleph instance to make it easy to be crawled by another. A possible approach, for example, might be to devise a ZIP archive format which includes a given source document, it's metadata and the output of text extraction. This zip could be dynamically generated by the API and quickly ingested on the other end. Basically, a one-day hack. This could then be extended to (optionally) redirect users to the source Aleph instance when they open a search result.

pudo commented 6 years ago

Been open too long, closing until there is a clear and stated user need.