Open alishiba14 opened 5 years ago
Hi,
Are your labels binary (relevant / non-relevant)?
If so, use a baseline ranker, e.g. BM25, to retrieve top 100 documents for a query. Then a training instance is (query, a relevant doc, a non-relevant doc in top100)
You may want to random sample the non-relevant documents in top100 instead of using all of them.
On Mon, May 13, 2019 at 2:15 AM Alishiba Dsouza notifications@github.com wrote:
Hi, I have a dataset of 16000 docs and I have some queries for each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGBNPGEAEFG3YGTHTETPVEWRJANCNFSM4HMN3GKQ .
Let's say for a query 'Apple' the relevant docs are 100,120,400 and rest all are non-relevant then 'Apple' \t 100,200,400\t remaining docs \t score. Is it the correct representation of the training instance?
@alishiba14 I think its true, but how can we represent remaining docs
? And score
, I think we can get it from BM25, right?
Let's say a query is 'Apple', and its relevant documents: Very relevant doc (score=2): 'iPhone X - apple.com', Somehow relevant doc (score=1): 'apple inc - wikipedia',
and 10 other non-relevant doc retrieved from BM25: Non-rel doc1 (score=0): 'apple juice is healthy'. Non-rel doc2 (score=0): 'apple is red' ... The training instance is: Apple \t iPhone X apple com \t apple juice is healthy \t 2 Apple \t iPhone X apple com \t apple is red \t 2 Apple \t apple inc - wikipedia \t apple juice is healthy \t 1 Apple \t apple inc - wikipedia \t apple is red \t 1
Then we map the words to word ids.
On Mon, May 20, 2019 at 9:41 PM Giang Nguyen notifications@github.com wrote:
@alishiba14 https://github.com/alishiba14 I think its true, but how can we represent remaining docs? And score, I think we can get it from BM25, right?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGHBN6P7WGICDCDKY3LPWNHLPA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2QEAA#issuecomment-494207488, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGCBH4GE5BX4R72X3HLPWNHLPANCNFSM4HMN3GKQ .
-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213
Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.
Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results.
On Mon, May 20, 2019 at 10:17 PM Giang Nguyen notifications@github.com wrote:
Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ .
-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213
I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)
: )
On Tue, May 21, 2019 at 2:58 AM Alishiba Dsouza notifications@github.com wrote:
I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGE5BO3YL52ZY26QMJTPWOMRDA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV26EOI#issuecomment-494264889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH4MRYOQROK6T7745LPWOMRDANCNFSM4HMN3GKQ .
-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213
Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results. … On Mon, May 20, 2019 at 10:17 PM Giang Nguyen @.***> wrote: Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ . -- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213
Its also what I thought, thanks for reply.
Hi, I have a dataset of 16000 docs and I have some queries. For each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?