Closed AlmogDavid closed 3 months ago
I'd also like more detail about how OpenResearcher works. As best I can tell from the readme it's using HTML versions of papers rather than the PDFs, which arXiv admits is "experimental" and advises the conversion process can "sometimes display errors." Does an open-source model an OpenResearcher user chooses need to be trained on how the conversion process might introduce errors? I imagine using HTML instead of PDF reduces the HD space required, but it shouldn't come at the cost of accuracy. Does OpenResearcher download/store papers based on what the user asks or does it just pre-download everything?
What is the minimal GPU memory size (can it be run without a GPU?) ? What is the minimal RAM? What is the minimal HD space needed.
This is an awesome work, thank you!
Thank you for your acknowledgment! Here are the hardware requirements and considerations for running OpenResearcher:
GPU Requirements:
For running the system: 1 GPU is needed For our experiments: We used an A800 80GB GPU
Vector Embedding Process:
Requires at least one GPU Memory usage can be adjusted using(in save_qdrant_indexing.sh):
embed_batch_size insert_batch_size
Example: Setting these to 2 and 4 respectively should allow running on GPUs with less than 24GB memory
Deployment (after vectorization):
Requires 1 GPU Minimal memory usage:
Without local reranker: Less than 8GB memory With local reranker: May use up to the GPU's full rated memory
RAM and HD space:
Requirements depend on the size of Elasticsearch and Qdrant databases This is directly related to the amount of document content you want to vectorize
You can adjust the configuration to fit your available hardware resources.
I'd also like more detail about how OpenResearcher works. As best I can tell from the readme it's using HTML versions of papers rather than the PDFs, which arXiv admits is "experimental" and advises the conversion process can "sometimes display errors." Does an open-source model an OpenResearcher user chooses need to be trained on how the conversion process might introduce errors? I imagine using HTML instead of PDF reduces the HD space required, but it shouldn't come at the cost of accuracy. Does OpenResearcher download/store papers based on what the user asks or does it just pre-download everything?
Thank you for your question. Given the inherent complexity of PDF formats, we opted against direct text extraction from PDFs. During our initial experiments, we employed the SmartPDFLoader for PDF text processing. While this method performed admirably in most cases, it occasionally encountered issues such as incorrect paragraph segmentation in certain instances. During the document vectorization process, we did not use any open-source LLMs for document processing. We only utilized an open-source vector model (specifically, gte-large-en-v1.5 in this project). So there is no need to train the model. If arXiv is unable to convert certain papers into HTML format, it's possible to process these papers separately. For example, you could use Smart PDF Loader to convert the PDFs into the required format, or employ other alternative methods. OpenResearcher offers users the flexibility to customize document handling. By modifying the code logic in connector/html_parsing.py, users can adapt the system to their specific needs. The key requirement is to convert documents (whether PDFs or other formats) into either a list of strings (list[str]) or a list of llama_index.core.Document objects. Once in this format, the content can be efficiently processed and added to the vector database. The vector database supports the addition of new documents incrementally. This feature allows users to insert documents into the vector database in stages, eliminating the need to process all text content in a single operation.
What is the minimal GPU memory size (can it be run without a GPU?) ? What is the minimal RAM? What is the minimal HD space needed.
This is an awesome work, thank you!