danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.93k stars 1.13k forks source link

Search results not clickable when using file connector and specifying metadata link: #1195

Open dbro opened 4 months ago

dbro commented 4 months ago

Using the file connector and the metadata as described here https://docs.danswer.dev/connectors/file , the links are not working.

In the html of the page, the value of the href property is an empty string, and not the value of the link specified. For comparison, the value specified for the file_display_name is working as expected.

Here is the test file I uploaded to Danswer:

#DANSWER_METADATA={"link": "https://docs.danswer.dev/connectors/file", "file_display_name": "Danswer can index files, too", "primary_owners": ["abc@123.xyz", "def@456.xyz"], "test_tag":"test_value"}
# File Connector

## Access knowledge from Local Files

How it works

The File Connector indexes user uploaded files. - Currently supports .txt, .pdf, .md and .mdx files - Can also upload a .zip containing these files - If there are other file types in the zip, the other file types are ignored - Optional metadata line that supports links, document owners, and time updated as metadata for Danswer’s retrieval and AI Answer

Adding Metadata

The metadata line should be placed at the very top of the file and can take one of two formats:

These screenshots show what appears in the search results, and in the page html. Notice the blue highlighted line has href="" instead of the expected href="https://docs.danswer.dev/connectors/file"

Screenshot 2024-03-07 165657 Screenshot 2024-03-07 165739
dbro commented 4 months ago

There might be an issue with an incorrect variable name, introduced with commit https://github.com/danswer-ai/danswer/commit/a4d5ac816e37973fd7d6ec143d5ea4cb6c68a1d5

This line works and refers to the variable called file_metadata : https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L72

This line might not work, and it refers to the variable called metadata : https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L101

So perhaps changing line 101 to use file_metadata.get() instead of metadata.get() would fix it?

Note there are other references to metadata.get() that might need to be updated: https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L78 https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L106 https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L107

this line is probably ok (?) https://github.com/danswer-ai/danswer/blob/3f1cd1ad129683090d85a15f6208bfd5a9428100/backend/danswer/connectors/file/connector.py#L135

wbste commented 4 months ago

Just built and seems to work for links now, but I don't see the primary or secondary owner info anywhere. Should it be in the Filters panel?

#DANSWER_METADATA={"link": "https://github.com/danswer-ai/danswer/blob/main/CONTRIBUTING.md", "primary_owners": ["yuhong@danswer.ai", "chris@danswer.ai"], "secondary_owners": ["founders@danswer.ai"], "doc_updated_at": "2024-03-09T13:06:08.589616-08:00", "file_display_name": "Sup Dog!", "type": "banana", "source": "other"}
How to set up captcha
Follow the example below to set up a captcha
like you saw when you visited this page!
By including a captcha, this page is able to
prevent web scrapers from reading it.
eojthebrave commented 2 months ago

It looks like this was broken again by the recent refactoring of the file utility functions in https://github.com/danswer-ai/danswer/pull/1449. That PR introduced the read_text_file function. And in backend/connectors/file/connector.py the function needs to be called with the ignore_danswer_metadata=False. Like this:

file_content_raw, file_metadata = read_text_file(file, encoding=encoding, ignore_danswer_metadata=False)

Otherwise it will simply bypass even trying to read the #DANSER_METADATA content from the file.

mcandio commented 2 months ago

Hey @eojthebrave, is it possible to reference the file internally? or is it mandatory to be an external link? I mean, I want to open the file that I recently uploaded but I keep receiving a 404, do I need to set a specific path?

eojthebrave commented 2 months ago

@mcandio I'm not sure. I'm very new to this project. What do you mean by reference it internally?

mcandio commented 2 months ago

@mcandio I'm not sure. I'm very new to this project. What do you mean by reference it internally?

I mean, for example, if I create the .danswer_metadata.json like this:

[
    {
        "file_display_name": "filename",
        "filename": "filename.pdf",
        "link": "./filename.pdf"
    }
]

What should be the link to the internal path where the file is stored? does the background deployment or the postgres database creates internal links to host these files?

SingTeng commented 1 month ago

I am having issue of result not clickable when using file connector, may I know if this issue has been resolved? Or is there any temporary fix?