facebookresearch / tart

Code and model release for the paper "Task-aware Retrieval with Instructions" by Asai et al.
Other
157 stars 11 forks source link

tart-dual training data #8

Open jordane95 opened 1 year ago

jordane95 commented 1 year ago

Hi, very interesting work on bring instructions into retrieval!

I want to use the tart-dual data for replication. After downloaded and unzipped, I only find one file named minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl.

Is this the exact data used to fine-tune the tart-dual model in your paper? Because I find this file only contains 1M lines yet the paper stated that BERRI has 5M instances from 37 datasets.

jordane95 commented 1 year ago

btw, during unzipping, i find some errors in log

Archive:  tart_dual_train_data.zip
warning [tart_dual_train_data.zip]:  12884901888 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  12884901888
  (attempting to re-compensate)
  inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl  

  error:  invalid compressed data to inflate

So I don't know if I have downloaed it correctly from google drive.

jordane95 commented 1 year ago

Another problem, I wanted to regroup the data into each source by finding the instructions in question field and leverage this file for reverse mapping. But I find some instructions are not included in the file.

For example, one instruction I met was

Given questions asked in StackExchange, a community-powered Q&A sites, retrieve a duplicated question body asking the same as this question.

But no such instruction in berri_instructions.tsv...

ImmortalCi commented 9 months ago

btw, during unzipping, i find some errors in log

Archive:  tart_dual_train_data.zip
warning [tart_dual_train_data.zip]:  12884901888 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  12884901888
  (attempting to re-compensate)
  inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl  

  error:  invalid compressed data to inflate

So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

ImmortalCi commented 9 months ago

@AkariAsai Could you help us solve this problem? Thanks a lot!~

jordane95 commented 9 months ago

btw, during unzipping, i find some errors in log

Archive:  tart_dual_train_data.zip
warning [tart_dual_train_data.zip]:  12884901888 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  12884901888
  (attempting to re-compensate)
  inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl  

  error:  invalid compressed data to inflate

So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

Yes. I think it might be the network issue during downloading. I re-download it and everthing works well.

ImmortalCi commented 9 months ago

Thanks!I will try again

------------------ Original ------------------ From: Zehan Li @.> Date: Fri, Oct 13, 2023 10:59 PM To: facebookresearch/tart @.> Cc: ImmortalCi @.>, Comment @.> Subject: Re: [facebookresearch/tart] tart-dual training data (Issue #8)

btw, during unzipping, i find some errors in log Archive: tart_dual_train_data.zip warning [tart_dual_train_data.zip]: 12884901888 extra bytes at beginning or within zipfile (attempting to process anyway) file #1: bad zipfile offset (local header sig): 12884901888 (attempting to re-compensate) inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl error: invalid compressed data to inflate
So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

Yes. I think it might be the network issue during downloading. I re-download it and everthing works well.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

ImmortalCi commented 9 months ago

btw, during unzipping, i find some errors in log

Archive:  tart_dual_train_data.zip
warning [tart_dual_train_data.zip]:  12884901888 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  12884901888
  (attempting to re-compensate)
  inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl  

  error:  invalid compressed data to inflate

So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

Yes. I think it might be the network issue during downloading. I re-download it and everthing works well.

Hi! I have retried to download the tart_dual_train_data.zip. But during the process of unzipping it. Error occured again. Sorry to bother but, could you send the ZIP file which you have unzipped successfully to me during your spare time?

I will appreciate so much if it's OK.

Thanks a lot !~

omarkhaled99 commented 6 months ago

btw, during unzipping, i find some errors in log

Archive:  tart_dual_train_data.zip
warning [tart_dual_train_data.zip]:  12884901888 extra bytes at beginning or within zipfile
  (attempting to process anyway)
file #1:  bad zipfile offset (local header sig):  12884901888
  (attempting to re-compensate)
  inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl  

  error:  invalid compressed data to inflate

So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

Yes. I think it might be the network issue during downloading. I re-download it and everthing works well.

Hi! I have retried to download the tart_dual_train_data.zip. But during the process of unzipping it. Error occured again. Sorry to bother but, could you send the ZIP file which you have unzipped successfully to me during your spare time?

I will appreciate so much if it's OK.

Thanks a lot !~

Hello @ImmortalCi , I came across this issue and I managed to unzip it correctly using the java jar command specifically,

jar -xvf <zip file>
ImmortalCi commented 6 months ago

Thanks for your reply! I will take a try~

------------------ Original ------------------ From: Omar Khaled Abdelhakim @.> Date: Tue,Jan 23,2024 8:51 AM To: facebookresearch/tart @.> Cc: ImmortalCi @.>, Mention @.> Subject: Re: [facebookresearch/tart] tart-dual training data (Issue #8)

btw, during unzipping, i find some errors in log Archive: tart_dual_train_data.zip warning [tart_dual_train_data.zip]: 12884901888 extra bytes at beginning or within zipfile (attempting to process anyway) file #1: bad zipfile offset (local header sig): 12884901888 (attempting to re-compensate) inflating: minilm_denoised_T0_32_datasets_fixed_instruction_unfollowing_fixed_train.jsonl error: invalid compressed data to inflate
So I don't know if I have downloaed it correctly from google drive.

Hello! I meet the same problem as you. Have you solved this problems now?

Yes. I think it might be the network issue during downloading. I re-download it and everthing works well.

Hi! I have retried to download the tart_dual_train_data.zip. But during the process of unzipping it. Error occured again. Sorry to bother but, could you send the ZIP file which you have unzipped successfully to me during your spare time?

I will appreciate so much if it's OK.

Thanks a lot !~

Hello @ImmortalCi , I came across this issue and I managed to unzip it correctly using the java jar command specifically, jar -xvf <zip file>
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>