Yandex Dataset issues - Githubissues

yogeswarl commented 1 year ago

Hello Dr. Fani, I have been doing an exploratory analysis on the Yandex dataset to figure out a way to convert the numerics into something that can be fed into T5. Unfortunately, from what I have done over the last 2 days suggest other wise. As we know, T5 is a text based model that requires queries and documents to be in a text format. From the sample document I post there is no means to convert them into query (text) or documents(text) . We also need a relevance judgements which are not available. This is dataset seems to be too old that cannot in anyway support our work (atleast from what I have seen from extracting all the gzips). Is there anyway you have any sort of file structure that contains query and documents as text.??

Thank you.

hosseinfani commented 1 year ago

@yogeswarl

I see. you're right. we cannot use the pretrained version and fine-tune it. So, we need to train t5 from scratch!
not sure I understood the numberic-2-text issue. you can treat each number as stream of chars. we can talk about it in lab tonight after 7pm.
the relevance judgment is like aol. you need to see which URL is clicked. however, there is no ir-dataset API here to use which makes it challenging. you need to write a class for yandex that provides the relevance judgment like ir-dataset API.
For the file structure and preprocessing, there is a nice track here: https://github.com/fani-lab/learning_to_refine_query/issues/8

yogeswarl commented 1 year ago

Sure I will be there in lab at 7PM to discuss more about this. I will also look the file structure and preprocessing.

yogeswarl commented 1 year ago

Hi, Dr. @hosseinfani. Here are my findings about the Yandex Dataset, I tried the following:

Lookup the structure of the data and analyzed if that can be indexed and searched.
Sampled a set of data to index.
Searched using our BM25 ( resulting in all 0s).

The problem we have with Yandex is that it is a learning to rank problem oriented dataset and not a question answering nor Query reformulation dataset.

This is the paper I read this morning to verify what I was doing wrong after 2 days.!

An example illustration:

This was one session from a user:

{"Day": 16,
 "Query": [
    {"Clicks": [{"SERPID": 0, "TimePassed": 23, "URLID": 68077432}, {"SERPID": 0, "TimePassed": 120, "URLID": 55655021}],
    "ListOfTerms": [3770075,3927504],
    "QueryID": 14073881,
    "SERPID": 0,
    "TimePassed": 0,
    "URL_DOMAIN": ["(38151716,3496197)","(55655021,4442030)","(70941346,5266706)","(68077432,5102360)","(32928488,3217249)","(62017857,4772167)","(5932479,777046)","(26162477,2597528)","(65876131,4989065)","(57789119,4560494)"]
    },
    {"Clicks": [],
    "ListOfTerms": [3770075,3927504],
    "QueryID": 14073881,
    "SERPID": 1,
    "TimePassed": 153,
    "URL_DOMAIN": ["(38151716,3496197)","(55655021,4442030)","(70941346,5266706)","(68077432,5102360)","(32928488,3217249)","(62017857,4772167)","(5932479,777046)","(26162477,2597528)","(65876131,4989065)","(57789119,4560494)"]
    },
    {"Clicks": [{"SERPID": 2, "TimePassed": 165, "URLID": 32928488}],
    "ListOfTerms": [3770075,3927504,3613096],
    "QueryID": 14073884,
    "SERPID": 2,
    "TimePassed": 161,
    "URL_DOMAIN": ["(32928488,3217249)","(38151716,3496197)","(70941346,5266706)","(5932479,777046)","(57130447,4519378)","(25741949,2572806)","(15235241,1570979)","(60465030,4687519)","(44999561,3896318)","(18154580,1906283)"]
    },
    {"Clicks": [],
    "ListOfTerms": [3770075,3927504,3190172,3613096],
    "QueryID": 14073882,
    "SERPID": 3,
    "TimePassed": 233,
    "URL_DOMAIN": ["(32928488,3217249)","(25741949,2572806)","(38151716,3496197)","(11149686,1107220)","(15235241,1570979)","(52514257,4295885)","(38164640,3497272)","(58421819,4589126)","(19709191,2026752)","(70631565,5225770)"]
    },
    {"Clicks": [{"SERPID": 4, "TimePassed": 248, "URLID": 25741949},{"SERPID": 4, "TimePassed": 285, "URLID": 38151777}],
    "ListOfTerms": [3770075,3927504,3209993,3613096],
    "QueryID": 14073883,
    "SERPID": 4,
    "TimePassed": 237,
    "URL_DOMAIN": ["(32928488,3217249)","(25741949,2572806)","(38151777,3496197)","(38151716,3496197)","(57789119,4560494)","(70636358,5226202)","(24747577,2473653)","(24961047,2498735)","(27756945,2709046)","(35034598,3300399)"]
    }
  ],
 "SessionID": 3766,
 "USERID": 639
}

If you look in this example. This is how we will need to structure our qrels and queries file:

example queries.tsv:

qid        queries
14073881        "[3770075,3927504]"

here the queries are the ListofTerms found in a Query document List.

qid        uid        did        rel
14073881        639        25741949        1

Here the DID is denoted if there is clicked document from the URL_DOMAIN, Something similar to matching Clicks dictionary with URL_DOMAIN.

the URL_DOMAIN is a set list of ranked documents already provided by Yandex themselves.

From the paper I read, The requirement of the challenge is to reorder the URL_Domain in such a way that you do a point wise or feature based extraction (adding weights to user id) to return the URL_Domain from test_set in such a way that the ranked list contains clicked URL in the top 10 list items.

This is what the Yandex Personalized web challenge is all about.

You can also look at one of submission that I found on GitHub. They just use simple python to solve this challenge.

Our scenario

For our Repair pipeline, This is what we need:

Index a collection(msmarco) or preprocess any given text collection and their respective relevant judgement and queries to create a collection(aol) using Lucene and faiss index.
create a docs.query pairing where a collection of documents are paired to a single (user) query.
Feed this to t5 for training, Infer the collection for new queries.
Use the index to search from the collection for ranked items.
use Trec_eval to evaluate our predicted query and original query.
Aggregate all predicted queries where they meet a criteria against the original query.
create datasets from these aggregated queries.

Now in the instance of Yandex. I am going to map out how multi stages of our pipeline will suffer .

This stage can be created with the example I have posted above. our index would look like
```
{id: 25741949, contents: [25741949, 2572806]}
```
The contents would be combination of URL_Domain which has a match with URL.
docs.query pair can also be made with this collection where every qid and did are considered as strings and an inner join will happen.
```
docs        query 
"[25741949, 2572806]"         "[3770075,3927504]"
```
Feeding this to T5 for training and inference would be same.
Now from our created index, we use a searcher; here comes the messed up part: BM25 does a matching of characters where it checks for query terms in the document terms, say "will it rain today?" and there are documents such as "today market price", "today weather", "today morning weather". "today rain forecast" But in this scenario it would not be able to since there will be no match and from training t5 will try to create queries based on the documents in this scenario a set of numbers.
When using trec_eval the ranking needs to match atleast one document which in this case it cannot do. I
Step 6 and step 7 will fail since step 5 and 4 are the main pillars to build step 6 and 7.

Here is a sample setup screenshot of the code I did by indexing the collection and testing various queries in various order. Screenshot 2023-08-17 at 10 17 44 PM here is the output image which returns zero when ranking with BM25, since every unique ID of a document is no way related to the query id. Note: all the documents and queries were converted into strings as discussed.

I have come to this conclusion after doing a thorough testing on Yandex that it is not possible for us to run our model and our pipeline for this dataset.

If you have any suggestion to tackle this problem, Please do let me know.

hosseinfani commented 1 year ago

Hi @yogeswarl My understanding of your report is that there is no overlap between the query tokens and domain-url tokens, and since we have no access to the actual webpage, the IR method based on token-match falls short to bring up any relevant domain-url. Is that right?

If so, I agree with you. We cannot use Yandex then.

How about this dataset: http://www.ifs.tuwien.ac.at/~clef-ip/download/2011/index.shtml

yogeswarl commented 1 year ago

Sure I will have a look into this and update you with my findings by tomorrow. Just to make sure this is the correct dataset we want to download right? https://researchdata.tuwien.at/records/a2svx-p1y38

hosseinfani commented 1 year ago

@yogeswarl

yes. foremost check it has user info. I don't think it has.

yogeswarl commented 1 year ago

@hosseinfani From what I look, There is no user information. It is all a collection of patents

hosseinfani commented 1 year ago

@yogeswarl For now, please focus on your ecir24 paper then. create a new gdoc and share it with me. copy from our sigir-ap draft. try to add more experiments, fix the issues, etc.

Later, we'll think about a new dataset.

hosseinfani commented 1 year ago

@yogeswarl I got an idea. Replace the domain-url ids with tokenids of query ONLY for clicked ones. This way, the word mismatch won't happen and Bm25 should return non-zero!

Try it and let me know

yogeswarl commented 1 year ago

Okay, Dr. @hosseinfani. I will try that

yogeswarl commented 1 year ago

Hello @hosseinfani, Since we cannot add this anymore, I am closing this issue. Will open one for Yahoo dataset and give you an update on the parser by end of this week.

fani-lab / RePair

Yandex Dataset issues #36

An example illustration:

This was one session from a user:

example queries.tsv:

Our scenario