danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
10.37k stars 1.25k forks source link

Add Support for text files with other extensions .org (org mode) or .md (markdown) #1415

Open nausher opened 5 months ago

nausher commented 5 months ago

I have quite a few notes that are created in Emacs Org-mode or Obsidian. These are markdown or org-mode files which have a .org or .md extension. These are text files with a different extension.

I uploaded these files to Danswer and they were 'indexed' but I see that all my search queries do not pull up any information from these files.

Can support be added for text files with non-'txt' extension.

nausher commented 5 months ago

I believe the change for this could be as simple as addding ".org" to this line in backend/danswer/connectors/file/utils.py _VALID_FILE_EXTENSIONS = [".txt", ".zip", ".pdf", ".md", ".mdx"] changed to - _VALID_FILE_EXTENSIONS = [".txt", ".zip", ".pdf", ".md", ".mdx",".org"]

https://github.com/danswer-ai/danswer/blob/143b50c519d916c81e072d8ca406bf0d87750761/backend/danswer/connectors/file/utils.py#L11

zarlor commented 5 months ago

Hmm... I have ingested .md files without issue. I think it might not read them as formatted files, mind you, but it does seem to accept them and they are searchable for me as text files, at least.

nausher commented 5 months ago

@zarlor - the issue seems to be now limited to ".org" files. The code has a filter to accept files with the extesnion ".md" & ".mdx"