danswer-ai / danswer

Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
https://docs.danswer.dev/
Other
9.77k stars 1.09k forks source link

Documents created via Web connector can not be updated via Ingestion API #1613

Open eojthebrave opened 3 weeks ago

eojthebrave commented 3 weeks ago

It is not currently possible to update items created using the Web connector via the Ingestion API. This is because the Web connector uses the URL of the indexed page as the document ID. But, when you call the Ingestion API, and set the ID to the URL of the page, the URL is escaped, and thus doesn't match the existing ID. I think this is because the Ingestion API calls Document::from_base which in turn uses make_url_compatible, which escapes the URL and causes for two non-matching document IDs. One with a standard URL, and one with an encoded URL. See https://github.com/danswer-ai/danswer/blob/main/backend/danswer/connectors/models.py#L145

The Web connector however creates the Document and assigns the ID without encoding the URL. See https://github.com/danswer-ai/danswer/blob/main/backend/danswer/connectors/web/connector.py#L254

It's not clear why make_url_compatible is used, and if it's safe to remove it or not. I think answering that question is the first step to resolving this issue.