alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
309 stars 59 forks source link

Indexing Atlassian Confluence #154

Open pudo opened 3 years ago

pudo commented 3 years ago

We have this recurring request from some editors to index project Confluence wikis into Aleph. The idea is to index all the reporters notes from a given wiki space into an investigation casefile. What we'd need to figure out:

pudo commented 3 years ago

https://atlassian-python-api.readthedocs.io/confluence.html

Rosencrantz commented 3 years ago

Hi @pudo Ex Confluence developer here. Great to hear that Connie is getting used by some editors. I might be able to provide a couple of random thoughts that may be useful in getting that data into Aleph...

First thing to consider is whether the editor is using Confluence Cloud or Confluence Server. Although the products have the same name the codebases are (now) pretty divergent and the way you achieve things can be significantly different depending on which product you want to interact with. Fun times.

One aspect of Confluence that is common to both cloud and server is the export function. If the space is relatively static, meaning if the editor has finished working on their notes and simply wants to import into Aleph then it might be easier to have the editor export the space using the Confluence export feature (there are numerous export options, html and xml for example). This export could then be ingested and transformed into something that Aleph/FtM can handle.

If that's not viable then you'll either want to get the rendered content for each page using the Confluence API or find a way of scaping the page with Memorious, which would leave you with the SSO/2FA challenge.

To work around challenges with SSO and 2FA you might be able to create a plugin that is installed on the Confluence instance. This plugin would have access to page content, comments, and attachments and could call back to an API to record that same information in Aleph.

Cloud plugins are effectively microservices and can be written in a bunch of different languages, Server plugins are built in Java. So, that might be something else to consider.

Another entirely random thought here would be to switch things around and, rather than export data from Confluence into Aleph, build an integration from Aleph into Confluence.

Rosencrantz commented 3 years ago

Confluence-space-export-155300.html.zip

The attached is a basic Confluence space export in HTML format. It contains content and attachments but unfortunately no comments. Importing this directly into Aleph produces output similar to the following:

aleph-confluence

It also exports a page which holds the structure of the space, so sub pages etc. What is somewhat annoying is that the links don't work so you can't navigate the space easily once it has been uploaded into Aleph. With that said it might be possible to extend the html ingestor to handle this?