fani-lab / RePair

Extensible and Configurable Toolkit for Query Refinement Gold Standard Generation Using Transformers
5 stars 5 forks source link

Issues with AOL #11

Closed yogeswarl closed 1 year ago

yogeswarl commented 1 year ago

Dear @hosseinfani, While creating the context free files for training I came across this issue which I believe required your attention!

There are passage Id's in qrels that do not have a valid document available in our indexed collection.I tried to search that passage id through out the document collection. So it fails to be indexed and hence fails to retrieve when we try to create the doccols I am using the same code as msmarco. This throws the 'NoneType' error. So as a workaround I have currently used an if statement to return an empty string. I would like your suggestion as to what we can do here.

I will push my code by morning so it would be great if you can have a look. I will be training with this dataset over the weekend

Thanks

Yogeswar

yogeswarl commented 1 year ago

Error: numpy.core._exceptions.MemoryError: Unable to allocate 78.9 TiB for an array with shape (21691710094096,) and data type int32 Currently, I am trying out with titles only. The text is a big issue and I see the paper I read uses titles and urls.

I will update once I find an update to this error.

yogeswarl commented 1 year ago

Error: NumPy.core._exceptions.MemoryError: Unable to allocate 78.9 TiB for an array with shape (21691710094096) and data type int32. Currently, I am trying out with titles only. The text is a big issue, and I see the paper I read uses titles and URLs.

I will let you know once I find an update on this error.

update on this issue: I tried setting this up on my laptop and ran it with only 16GB ram. It ran without any problem, and I got the output. This brings me thinking about what could have gone wrong. I currently ran into this issue and can tell you this has been very weird. Some say it runs on their mac but not on their Linux; Some say the opposite. So it has to do with the pandas' version.

I will be comparing my pandas' version of my laptop and our workstation to resolve it this evening.

While in the same time, I will also start the training for these documents.

yogeswarl commented 1 year ago

Update on the issue:

Screen Shot 2023-01-23 at 8 50 01 PM

This occurs to happen only with Text files. I have been unable to find a solution until now. I am going to try a different approach. WIll update you if that works

yogeswarl commented 1 year ago

The solution to this seems very interesting. We will have to add observed=True. This is an issue when numpy sees columns that are empty and tries to solve it by allocating massive arrays to them. Links to this issue and their solution. You may look at GitHub to explore this issue further stackoverflow github

Thanks.! I will keep this tab open to update more errors from this document

yogeswarl commented 1 year ago

@hosseinfani, I believe it is time-consuming and worth not much effort to include text documents in our training. We will be working with Titles and URL. I will continue to run it for titles and URLs. Will update you with the results.

yogeswarl commented 1 year ago

@hosseinfani, I have trained title's, but found out we are running out of tensor space due to its huge file lines. So, I will be going ahead with this method code line 2,3 and 4

yogeswarl commented 1 year ago

@hosseinfani. Trec_Eval is unable to compute metrics. Please help me with this. Just for your info, the bm25 computed file is 18.5GB in size. Could this cause an issue?

image

yogeswarl commented 1 year ago

@hosseinfani. I have updated the code to run as multiple bm25 files. I will update you if this solves the issue

yogeswarl commented 1 year ago

@hosseinfani,

update: Trec_eval has a lock on how many lines it can read. I am unable to give you a number. I will explore more on this when I can discover the root problem and update here. In the meantime, when I broke down the file to 10,000,000, it seemed to read and handle them. So we will have to do a lot of splitting and merging altogether.

yogeswarl commented 1 year ago

A minor update on this issue: In my recent use of trec_eval on compute canada showed it had no problem loading the results and the qrels files as long as there was enough ram per CPU. I guess this issue can be closed and the final conclusion is that. If we have enough memory, we can compute any sized results file using trec_eval.

Thanks