facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.45k stars 2.1k forks source link

blenderbot2.0 searching engine related question #4585

Closed daje0601 closed 2 years ago

daje0601 commented 2 years ago

When I read the paper and saw the model structure image, I understood that DPR and internet search can be done together.

But while looking through the code, this question came to me. "Is it possible to use DPR and an internet search engine at the same time? " The reason I asked this question was because I followed the code and found the code below in ParlAI/parlai/agents/rag/retrevers.py

    # only build retriever when not converting a BART model
    retriever = RetrieverType(opt['rag_retriever_type'])
    if retriever is RetrieverType.DPR:
        return DPRRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.TFIDF:
        return TFIDFRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.DPR_THEN_POLY:
        return DPRThenPolyRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.POLY_FAISS:
        return PolyFaissRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.SEARCH_ENGINE:
        return SearchQuerySearchEngineRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.SEARCH_TERM_FAISS:
        return SearchQueryFAISSIndexRetriever(opt, dictionary, shared=shared)
    elif retriever is RetrieverType.OBSERVATION_ECHO_RETRIEVER:
        return ObservationEchoRetriever(opt, dictionary, shared=shared)

Please understand if I misunderstood and asked the wrong question. Thanks!

mojtaba-komeili commented 2 years ago

Are you referring to the "Internet-Augmented Dialogue Generation" paper? Because in there we had either DPR or the Internet search. And what you see here for the code is correct: you can't have them both in its current setup.

daje0601 commented 2 years ago

Are you referring to the "Internet-Augmented Dialogue Generation" paper? Because in there we had either DPR or the Internet search. And what you see here for the code is correct: you can't have them both in its current setup.

Thank you so much!

daje0601 commented 2 years ago

Are you referring to the "Internet-Augmented Dialogue Generation" paper? Because in there we had either DPR or the Internet search. And what you see here for the code is correct: you can't have them both in its current setup.

Hello, @mojtaba-komeili I retry to read the Internet-augmented Dialogue Generation paper 3times But, I can not found the "we had either DPR or the Internet search." Please let me know that Where in the section does this say this?

mojtaba-komeili commented 2 years ago

On section 3 in the paper, we discuss the two alternative methods that we are using for retrieval: 3.1 ) FAISS-based methods, which are DPR ones, and 3.2) SEA which is our Internet search using an external search engine. Also if you check out table 4 and 5, under the "Knowledge Access method" we specify whether external search engine or DPR was used. Note that there is an intermediate technique which is "Search Query+FAISS". For that one, we generated a search query, but instead of sending it to the search engine we used DPR retrieval with the generated search query. Hope this clarifies things. But let us know if there were further questions.

mojtaba-komeili commented 2 years ago

Re-opening this, since the question seems to be still standing.

daje0601 commented 2 years ago

Re-opening this, since the question seems to be still standing.

I am reading it again and ask a question. Thank you so much for your kind guidance. :)

daje0601 commented 2 years ago

I think I understand a little bit of your methods.

I have a new question. Q1 in 3.2 : 'Training and Inference is not actually an internet search, Only searching a value from the Common Crawl table?' Q2 in 3.2 : 'If a value searching from the Common Crawl table, is it possible to update real-time documents?' Q3 in 4.1 : How to caculate 84.81% ?? Q4 in table2 : Number of unique URLs selected total is written 29,500 but train, valid, test sum is 30,252... What could I have misunderstood? Q5 in table1 : Number of unique Domains selected total is written 11,963 but train, valid, test sum is 13,407... What could I have misunderstood? Q6 in table2 : In the table2, the T5 performed the best, but is the speed the reason I chose the Bart? I guess..


My Goal is to understand blenderbot2.0 pipeline(work flow) and training method. And then I want to customize korean version. It was written using a translator, so there may be some awkwardness. Please mercy on me.

mojtaba-komeili commented 2 years ago

Hi, sure be glad to answer: Q1: Yes, you are right. We retrieve the documents from the Common Crawl table. But the urls are coming from a web search engine (Bing). So, we receive the urls from Bing and look up the ones that are available on CC and return those. We kept our result to CC for the sake of this research. Q2: If you want to do that you may use the web page directly. We avoided that for research purpose. But one may try that (given the licensing agreement is given for the usage). Q3: That is the percentage of response that Wizard searched for something in order to respond. It mean that in ~15% of the responses from the Wizard there was no need to do search and Wizard responded directly without any search. Q4 and 5: The reason is that some of the urls are shared between the splits. We don't count them twice in the mixed dataset. Q6: Good question. We had most of our setup ready for Bart and you mentioned T5 was a bigger model. But the general conclusions can possibly be extended to T5 too.

daje0601 commented 2 years ago

Thank you so much for your kind reply.

I have another question, so I brought it. I want to train the msc with blenderbot2.0 in server. Then, I entered the code as below and ran it. It was writtend by referring to parlai issues page.

Q1) Is this the code I wrote correctly? -. goal : to use internet-augmented method Q2) If it's wrong, what should I fix?

!parlai train_model -dp data \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task msc --num_epochs 10 --include_last_session True \
--memory-decoder-model-file "" --memory-key personas \
--search-query-generator-model-file zoo:blenderbot2/query_generator/model --search-query-generator-beam-min-length 2 \
--save-every-n-secs 600 --validation_every_n_secs 600 --log_every_n_secs 60 \
--init-model zoo:blenderbot2/blenderbot2_400M/model --dict-file zoo:blenderbot2/blenderbot2_400M/model.dict \
--datatype train:stream  \
--embeddings-scale True --variant prelayernorm --split-lines True --learn-positional-embeddings True \
--n-layers 12 --embedding-size 1024 --ffn-size 4096 --n-heads 16 --n-decoder-layers 12 \
--dict-tokenizer gpt2 --generation-model bart \
--query-model bert_from_parlai_rag \
--rag-model-type token --rag-retriever-type search_engine --search_server None \
--dpr-model-file zoo:hallucination/bart_rag_token/model \
--gold-document-titles-key select-docs-titles --insert-gold-docs True \
--beam-min-length 20 --beam-context-block-ngram 3 --beam-block-ngram 3 --beam-block-full-context False --beam-size 10 \
--inference beam --optimizer mem_eff_adam --learningrate 1e-05 --lr-scheduler-patience 1 --model-parallel True \
--knowledge-access-method memory_only --batchsize 1 \
--truncate 512 --text-truncate 512 --label-truncate 128 \
--dropout 0.0 --attention-dropout 0.0 \
--min-doc-token-length 64 --max-doc-token-length 256 \
--fp16 True --fp16-impl mem_efficient --force-fp16-tokens True \

It runs for some 1 ~ 2hours, and there is no change in the code below for several hours. I can't seem to find out what I'm doing wrong, as I'm not getting a specific error. How do I make the corrections??? or Just wait??? image

daje0601 commented 2 years ago

Are you referring to the "Internet-Augmented Dialogue Generation" paper? Because in there we had either DPR or the Internet search. And what you see here for the code is correct: you can't have them both in its current setup.

you can't have them both in its current setup.

when I 'm test inference, My bot use the RAG(DPR) and internet search. Although search engine make error.

image

I confuse between your explaination you can't have them both in its current setup. and My bot answer. Please Merry on me.

klshuster commented 2 years ago

The bot is still able to answer without search results

daje0601 commented 2 years ago

The bot is still able to answer without search results

Thank you!