Closed mailong25 closed 1 year ago
(1) From your examples it doesn't look like extracting content from HTML is required; however, I'm sure there are tools available online that will extract appropriately (2 + 3) Indeed, selecting the appropriate knowledge sentence is an open research problem. BB3 was trained to select the most appropriate knowledge sentence from a set of relevant returned search results, and this is indeed the knowledge response module. We truncate each document to ~500 characters before providing it to the agent; these documents are newline-delimited. Also note that the full context + documents must fit into a truncation length of 1024 tokens for BB3 3B (2048 for 30B/175B) (4) SeeKeR mapped Bing URLs to common crawl webpages, whereas BB3 uses snippets from Mojeek; determining which is better is an open research question as well.
This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.
I have chatted with the BB3 model using an Internet search server from JulesGM/ParlAI_SearchEngine. However, the results are not good mostly because the retrieved documents are very noisy. Here is an example:
Search Queries: ['dota 2']
Search URLS:
Examples of search_knowledge_doc_content (retrieved documents) Doc_1
Doc_2
Doc_3
. As you can see, there is a lot of noise in the retrieved documents, so I wonder what is the detailed implementation to parse the results returned by the search server (or how can go from search URLs to the final retrieved documents). I believe there are a number of problem occurs such as: (1) How do I extract text from the HTML content of the page? (2) The extracted text in (1) might be very long and also contains a lot of noisy/irrelevant information. How do I select only the relevant part? Is BB3 using any trained model to do this kind of selection? (3) I saw that the paper using a "knowledge response model" to generate a sequence referred to as the knowledge response, given the full input context and a set of retrieved documents. Are these documents the full text of the retrieved page in (1)? or they going to be truncated? (4) The BB3 used Mojeek as a server while SEEKER used Microsoft Bing API. I wonder which one gives a better result? . I had looked into the technical papers but these problems are not mentioned.