AdeDZY / DeepCT

DeepCT and HDCT uses BERT to generate novel, context-aware bag-of-words term weights for documents and queries.
BSD 3-Clause "New" or "Revised" License
312 stars 46 forks source link

comments/questions #14

Open cmacdonald opened 3 years ago

cmacdonald commented 3 years ago
AdeDZY commented 3 years ago

Hi Craig,

"What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection.

"Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use." -> Thanks for the suggestion. I didn't include those because they are huge as terms are being repeated. I'll try to find those files and add to the data folder.

On Fri, Apr 16, 2021 at 11:19 PM Craig Macdonald @.***> wrote:

  • Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?
  • Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.
  • Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGEPRF3RF74Q46KXBUTTJER7FANCNFSM43CXSTLQ .

cmacdonald commented 3 years ago

I try to work with gzip files, as they can be read and written in streams (indeed, I patched the bert_term_sample_to_json.py script to write gzip files automatically). Generated using m=100, the output deepctcollection.gz is much smaller than test_results.tsv.zip/gz

$ls -lh
total 8.3G
-rw-r--r-- 1 craigm csstaff 446M Apr 20 11:53 deepctcollection.gz
-rw-r--r-- 1 craigm csstaff 4.0G Apr 16 22:24 test_results.tsv.gz
-rw-r--r-- 1 craigm csstaff 4.0G Nov 26  2019 test_results.tsv.zip
(pyterrier) [craigm@trhead collection_pred_1]$less deepctcollection.gz

I also had to align the docids to account for empty documents, by changing bert_term_sample_to_json.py as follows:

            if not selected_tokens:
                output_file.write(did + '\t' + ' \n') # added by craig
                e += 1
                continue
AdeDZY commented 3 years ago

Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files.

In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.