Support for pdf file context provider

continuedev / continue

⏩ Continue is the leading open-source AI code assistant. You can connect any models and any context to build custom autocomplete and chat experiences inside VS Code and JetBrains

https://docs.continue.dev/

Apache License 2.0

15.57k stars 1.17k forks source link

Support for pdf file context provider #483

Open rpg2807 opened 11 months ago

rpg2807 commented 11 months ago

First of all, thanks for the extension. It seems like a great tool. I tried supplying a link to online pdf file using @url but it seemed to read the encodes pdf file as plain text. Please show me how to add pdf file content as context or add the feature. Btw, I see embedding context provider in the plug-in directory. Not sure how to use it though.

sestinj commented 11 months ago

@rpg2807 We have documentation here on how to add a context provider: https://continue.dev/docs/customization/context-providers#building-your-own-context-provider

Once you have a new context provider (the embeddings provider included), you can add it you your ~/.continue/config.py like this:

from continuedev.src.continuedev.plugins.context_providers.github import GitHubIssuesContextProvider

...
config=ContinueConfig(
  ...
  context_providers=[
    GitHubIssuesContextProvider(
      repo_name="continuedev/continue",  # change to whichever repo you want to use
      auth_token="<my_github_auth_token>",
    )
  ]
)

The embeddings context provider might work, so you could give it a try, but we will be working on it later this week and I can share when we have a production-ready version : )

sestinj commented 11 months ago

It might make sense (and be easier) to just add .pdf functionality to the URLContextProvider, just as basically an if statement if the URL is a .pdf, and then decode it to text in the specific way needed. Could just implement that in this function without needing to rewrite any of the context provider logic

rpg2807 commented 11 months ago

Thanks for the suggestion. That does look like an easier way around. When I tried, I was having issues correctly installing/importing PyPDF2 module. I tried adding PyPDF2 in the requirements.txt, explicitly called 'pip install PyPDF2' in build.sh but nothing seems to be helping. When I load the extension, this is what I get:

File "/tmp/_MEI9a8UlQ/continuedev/src/continuedev/plugins/context_providers/url.py", line 7, in <module>
    from PyPDF4 import PdfFileReader

ModuleNotFoundError: No module named 'PyPDF2'

Could you please instruct how could one add a python module to be used in the context provider?

sestinj commented 11 months ago

@rpg2807 This is a limitation of running the server as a "frozen" binary. Any packages not included in the binary cannot be imported afterward, so you'll have to bundle it into the binary.

You did the right thing by adding it to requirements.txt, but sometimes pyinstaller misses a few imports when building the binary, which you'll have to list in the hidden_imports field in continue/run.spec. So you could try adding 'PyPDF2' to this list and then building again.

oldluke92 commented 4 months ago

@rpg2807 did you make progress here ? If so, are you willing to share what you did ?