Closed yogeswarl closed 1 year ago
@yogeswarl
Sure I will be there in lab at 7PM to discuss more about this. I will also look the file structure and preprocessing.
Hi, Dr. @hosseinfani. Here are my findings about the Yandex Dataset, I tried the following:
The problem we have with Yandex is that it is a learning to rank problem oriented dataset and not a question answering nor Query reformulation dataset.
This is the paper I read this morning to verify what I was doing wrong after 2 days.!
{"Day": 16,
"Query": [
{"Clicks": [{"SERPID": 0, "TimePassed": 23, "URLID": 68077432}, {"SERPID": 0, "TimePassed": 120, "URLID": 55655021}],
"ListOfTerms": [3770075,3927504],
"QueryID": 14073881,
"SERPID": 0,
"TimePassed": 0,
"URL_DOMAIN": ["(38151716,3496197)","(55655021,4442030)","(70941346,5266706)","(68077432,5102360)","(32928488,3217249)","(62017857,4772167)","(5932479,777046)","(26162477,2597528)","(65876131,4989065)","(57789119,4560494)"]
},
{"Clicks": [],
"ListOfTerms": [3770075,3927504],
"QueryID": 14073881,
"SERPID": 1,
"TimePassed": 153,
"URL_DOMAIN": ["(38151716,3496197)","(55655021,4442030)","(70941346,5266706)","(68077432,5102360)","(32928488,3217249)","(62017857,4772167)","(5932479,777046)","(26162477,2597528)","(65876131,4989065)","(57789119,4560494)"]
},
{"Clicks": [{"SERPID": 2, "TimePassed": 165, "URLID": 32928488}],
"ListOfTerms": [3770075,3927504,3613096],
"QueryID": 14073884,
"SERPID": 2,
"TimePassed": 161,
"URL_DOMAIN": ["(32928488,3217249)","(38151716,3496197)","(70941346,5266706)","(5932479,777046)","(57130447,4519378)","(25741949,2572806)","(15235241,1570979)","(60465030,4687519)","(44999561,3896318)","(18154580,1906283)"]
},
{"Clicks": [],
"ListOfTerms": [3770075,3927504,3190172,3613096],
"QueryID": 14073882,
"SERPID": 3,
"TimePassed": 233,
"URL_DOMAIN": ["(32928488,3217249)","(25741949,2572806)","(38151716,3496197)","(11149686,1107220)","(15235241,1570979)","(52514257,4295885)","(38164640,3497272)","(58421819,4589126)","(19709191,2026752)","(70631565,5225770)"]
},
{"Clicks": [{"SERPID": 4, "TimePassed": 248, "URLID": 25741949},{"SERPID": 4, "TimePassed": 285, "URLID": 38151777}],
"ListOfTerms": [3770075,3927504,3209993,3613096],
"QueryID": 14073883,
"SERPID": 4,
"TimePassed": 237,
"URL_DOMAIN": ["(32928488,3217249)","(25741949,2572806)","(38151777,3496197)","(38151716,3496197)","(57789119,4560494)","(70636358,5226202)","(24747577,2473653)","(24961047,2498735)","(27756945,2709046)","(35034598,3300399)"]
}
],
"SessionID": 3766,
"USERID": 639
}
If you look in this example. This is how we will need to structure our qrels and queries file:
qid queries
14073881 "[3770075,3927504]"
here the queries are the ListofTerms
found in a Query document List.
qid uid did rel
14073881 639 25741949 1
Here the DID
is denoted if there is clicked document from the URL_DOMAIN, Something similar to matching Clicks dictionary
with URL_DOMAIN
.
the URL_DOMAIN is a set list of ranked documents already provided by Yandex themselves.
From the paper I read, The requirement of the challenge is to reorder the URL_Domain in such a way that you do a point wise or feature based extraction (adding weights to user id) to return the URL_Domain from test_set in such a way that the ranked list contains clicked URL in the top 10 list items.
This is what the Yandex Personalized web challenge is all about.
You can also look at one of submission that I found on GitHub. They just use simple python to solve this challenge.
For our Repair pipeline, This is what we need:
collection(msmarco)
or preprocess any given text collection and their respective relevant judgement and queries to create a collection(aol)
using Lucene
and faiss
index.Now in the instance of Yandex. I am going to map out how multi stages of our pipeline will suffer
.
{id: 25741949, contents: [25741949, 2572806]}
The contents would be combination of URL_Domain which has a match with URL.
docs query
"[25741949, 2572806]" "[3770075,3927504]"
Here is a sample setup screenshot of the code I did by indexing the collection and testing various queries in various order. here is the output image which returns zero when ranking with BM25, since every unique ID of a document is no way related to the query id. Note: all the documents and queries were converted into strings as discussed.
I have come to this conclusion after doing a thorough testing on Yandex that it is not possible for us to run our model and our pipeline for this dataset.
If you have any suggestion to tackle this problem, Please do let me know.
Hi @yogeswarl My understanding of your report is that there is no overlap between the query tokens and domain-url tokens, and since we have no access to the actual webpage, the IR method based on token-match falls short to bring up any relevant domain-url. Is that right?
If so, I agree with you. We cannot use Yandex then.
How about this dataset: http://www.ifs.tuwien.ac.at/~clef-ip/download/2011/index.shtml
Sure I will have a look into this and update you with my findings by tomorrow. Just to make sure this is the correct dataset we want to download right? https://researchdata.tuwien.at/records/a2svx-p1y38
@yogeswarl
yes. foremost check it has user info. I don't think it has.
@hosseinfani From what I look, There is no user information. It is all a collection of patents
@yogeswarl For now, please focus on your ecir24 paper then. create a new gdoc and share it with me. copy from our sigir-ap draft. try to add more experiments, fix the issues, etc.
Later, we'll think about a new dataset.
@yogeswarl I got an idea. Replace the domain-url ids with tokenids of query ONLY for clicked ones. This way, the word mismatch won't happen and Bm25 should return non-zero!
Try it and let me know
Okay, Dr. @hosseinfani. I will try that
Hello @hosseinfani, Since we cannot add this anymore, I am closing this issue. Will open one for Yahoo dataset and give you an update on the parser by end of this week.
Hello Dr. Fani, I have been doing an exploratory analysis on the Yandex dataset to figure out a way to convert the numerics into something that can be fed into T5. Unfortunately, from what I have done over the last 2 days suggest other wise. As we know, T5 is a text based model that requires queries and documents to be in a text format. From the sample document I post there is no means to convert them into query (text) or documents(text) . We also need a relevance judgements which are not available. This is dataset seems to be too old that cannot in anyway support our work (atleast from what I have seen from extracting all the gzips). Is there anyway you have any sort of file structure that contains query and documents as text.??
Thank you.