facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.47k stars 2.09k forks source link

BB3 Internet search: From search URLs to final retrieved documents #4859

Closed mailong25 closed 1 year ago

mailong25 commented 1 year ago

I have chatted with the BB3 model using an Internet search server from JulesGM/ParlAI_SearchEngine. However, the results are not good mostly because the retrieved documents are very noisy. Here is an example:

Search Queries: ['dota 2']

Search URLS:

https://www.pcgamingwiki.com/wiki/Dota_2                                                                                                                                 
https://en.wikipedia.org/wiki/Dota_2                                                                                                                                     
https://www.oneesports.gg/dota2/valve-announces-new-dota-2-hero-muerta/                                                                                                  
https://dotesports.com/dota-2/news/muerta-revealed-as-next-dota-2-hero-at-ti11                                                                                           
https://dotesports.com/dota-2/news/quincy-crew-curse-soniqs-are-out-of-dota-2-less-than-three-months-after-signing-team

Examples of search_knowledge_doc_content (retrieved documents) Doc_1

* Explore\n* Lists\n* Games\n* Categories\n* Random page\n* Recent changes\n* Troubleshooting guide\n* Editing\n* Editing guide\n* Sample article\n* Projects\n* Taxonomy\n* Wiki policy\n* Maintenance\n* Changelog\n* Community\n* Assignments\n* Discord\n* Files\n* Files policy\n* Forums\n* PCGW Account\n* Other communities\n* About\n* About\n* Conduct\n* FAQ\n* Staff\n* Donate\n* Tools\n* What links here\n* Related changes\n* Special pages\n* Printable version\n* Permanent link\n* Page information\n* Page values\n* Talk\n* Contributions\n*

Doc_2

'# Dota 2\nFrom Wikipedia, the free encyclopedia\nJump to navigation Jump to search\n2013 video game\n2013 video game\nDota 2\nDeveloper(s)Valve\nPublisher(s)Valve\nDesigner(s)IceFrog\nWriter(s)\n* Marc Laidlaw\n* Ted Kosmatka\n* Kris Katz\nComposer(s)\n* Jason Hayes\n* Tim Larkin\nSeriesDota\nEngineSource 2[a]\nPlatform(s)\n* Windows\n* Linux\n* OS X\nRelease\n* Windows\n* July 9, 2013\n* Linux, OS X\n* July 18, 2013\nGenre(s)MOBA\nMode(s)Multiplayer\nDota 2 is a 2013 multiplayer online battle arena (MOBA) video game developed\nand'

Doc_3

"*   *   *   *   *\nAbout Press T&C Contact Us\n* Mobile Legends\n* LEAGUE OF LEGENDS\n* Valorant\n* Dota 2\n* Pick'em\n* Genshin Impact\n* Anime\n* More\n* Cosplay\n* Culture\n* Call of Duty\n* Wild Rift\n* Free Fire\n* PUBG\n* Tekken\n* Street Fighter\n* Fortnite\n* Gaming\n* Events\n* About us\n* Work with us\n* Partner with us\n* Press\n* PRIVACY\n* Contact Us\nShop\n* en\n* English\n* Bahasa Indonesia\n* Filipino\n* Tiếng Việt\n* ไทย\nLogin\nLoading...\n* Mobile Legends\n* LEAGUE OF LEGENDS\n* Valorant\n* Dota 2\n* Pick'em\n* Genshin"

. As you can see, there is a lot of noise in the retrieved documents, so I wonder what is the detailed implementation to parse the results returned by the search server (or how can go from search URLs to the final retrieved documents). I believe there are a number of problem occurs such as: (1) How do I extract text from the HTML content of the page? (2) The extracted text in (1) might be very long and also contains a lot of noisy/irrelevant information. How do I select only the relevant part? Is BB3 using any trained model to do this kind of selection? (3) I saw that the paper using a "knowledge response model" to generate a sequence referred to as the knowledge response, given the full input context and a set of retrieved documents. Are these documents the full text of the retrieved page in (1)? or they going to be truncated? (4) The BB3 used Mojeek as a server while SEEKER used Microsoft Bing API. I wonder which one gives a better result? . I had looked into the technical papers but these problems are not mentioned.

klshuster commented 1 year ago

(1) From your examples it doesn't look like extracting content from HTML is required; however, I'm sure there are tools available online that will extract appropriately (2 + 3) Indeed, selecting the appropriate knowledge sentence is an open research problem. BB3 was trained to select the most appropriate knowledge sentence from a set of relevant returned search results, and this is indeed the knowledge response module. We truncate each document to ~500 characters before providing it to the agent; these documents are newline-delimited. Also note that the full context + documents must fit into a truncation length of 1024 tokens for BB3 3B (2048 for 30B/175B) (4) SeeKeR mapped Bing URLs to common crawl webpages, whereas BB3 uses snippets from Mojeek; determining which is better is an open research question as well.

github-actions[bot] commented 1 year ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.