Dataset Preparation - Githubissues

alishiba14 commented 5 years ago

Hi, I have a dataset of 16000 docs and I have some queries. For each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?

AdeDZY commented 5 years ago

Hi,

Are your labels binary (relevant / non-relevant)?

If so, use a baseline ranker, e.g. BM25, to retrieve top 100 documents for a query. Then a training instance is (query, a relevant doc, a non-relevant doc in top100)

You may want to random sample the non-relevant documents in top100 instead of using all of them.

On Mon, May 13, 2019 at 2:15 AM Alishiba Dsouza notifications@github.com wrote:

Hi, I have a dataset of 16000 docs and I have some queries for each query there can be more than one relevant document. Can you tell me how can I prepare my data and also the evaluation?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGBNPGEAEFG3YGTHTETPVEWRJANCNFSM4HMN3GKQ .

alishiba14 commented 5 years ago

Let's say for a query 'Apple' the relevant docs are 100,120,400 and rest all are non-relevant then 'Apple' \t 100,200,400\t remaining docs \t score. Is it the correct representation of the training instance?

giangnguyen2412 commented 5 years ago

@alishiba14 I think its true, but how can we represent remaining docs? And score, I think we can get it from BM25, right?

AdeDZY commented 5 years ago

Let's say a query is 'Apple', and its relevant documents: Very relevant doc (score=2): 'iPhone X - apple.com', Somehow relevant doc (score=1): 'apple inc - wikipedia',

and 10 other non-relevant doc retrieved from BM25: Non-rel doc1 (score=0): 'apple juice is healthy'. Non-rel doc2 (score=0): 'apple is red' ... The training instance is: Apple \t iPhone X apple com \t apple juice is healthy \t 2 Apple \t iPhone X apple com \t apple is red \t 2 Apple \t apple inc - wikipedia \t apple juice is healthy \t 1 Apple \t apple inc - wikipedia \t apple is red \t 1

Then we map the words to word ids.

On Mon, May 20, 2019 at 9:41 PM Giang Nguyen notifications@github.com wrote:

@alishiba14 https://github.com/alishiba14 I think its true, but how can we represent remaining docs? And score, I think we can get it from BM25, right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGHBN6P7WGICDCDKY3LPWNHLPA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2QEAA#issuecomment-494207488, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGCBH4GE5BX4R72X3HLPWNHLPANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

giangnguyen2412 commented 5 years ago

Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.

AdeDZY commented 5 years ago

Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results.

On Mon, May 20, 2019 at 10:17 PM Giang Nguyen notifications@github.com wrote:

Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

alishiba14 commented 5 years ago

I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)

AdeDZY commented 5 years ago

: )

On Tue, May 21, 2019 at 2:58 AM Alishiba Dsouza notifications@github.com wrote:

I guess it is clear now. I will try and do it and get back to you. Thanks a lot for your reply. :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/K-NRM/issues/15?email_source=notifications&email_token=ABHQHGE5BO3YL52ZY26QMJTPWOMRDA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV26EOI#issuecomment-494264889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH4MRYOQROK6T7745LPWOMRDANCNFSM4HMN3GKQ .

-- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

giangnguyen2412 commented 5 years ago

Which dataset are you using? Is it your own dataset? I guess you need to pass over a traditional IR and take the results. … On Mon, May 20, 2019 at 10:17 PM Giang Nguyen @.***> wrote: Could you please tell me where can I get the training data. Should I pull from an available dataset or I need to pass over a traditional IR (BM25), them take the result of BM25 as my training data? Thank you. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15?email_source=notifications&email_token=ABHQHGCZUQ6KMSTSIIJF7WDPWNLSTA5CNFSM4HMN3GK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV2RU6I#issuecomment-494213753>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHQHGH2GOFUFENUJGTEH6DPWNLSTANCNFSM4HMN3GKQ . -- Zhuyun Dai Language Technologies Institute School of Computer Science 5000 Forbes Avenue Pittsburgh, PA 15213

Its also what I thought, thanks for reply.

AdeDZY / K-NRM

Dataset Preparation #15