danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.75k stars 1.09k forks source link

Content extraction from JIRA using API version 3 is incomplete #1677

Open artmatsak opened 1 week ago

artmatsak commented 1 week ago

The function currently used for converting Atlassian Document Format (ADF), extract_text_from_content() (source) is very simplistic and has the following issues:

  1. No paragraph separation with newlines
  2. Lists are completely skipped
  3. Possibly others.

This means that when using JIRA API version 3, Danswer will extract incomplete information from the JIRA issues.

artmatsak commented 1 week ago

Here is some additional research on this.

First, there doesn't appear to be a readily available Python library for converting ADF into HTML, Markdown or plain text.

Second, we could obtain issue description as rendered HTML (from which Markdown is easy) using expand="renderedFields" in the JIRA issue search request. However, there doesn't seem to be a way to have comment bodies rendered with that same request, too. Instead, we would have to issue a separate API request per each issue for fetching that issue's comments where we specify expand="renderedBody". And I'm not sure if submitting a separate additional API request per issue is a good thing to do.