Open cmacdonald opened 3 years ago
Hi Craig,
"What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection.
"Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use." -> Thanks for the suggestion. I didn't include those because they are huge as terms are being repeated. I'll try to find those files and add to the data folder.
On Fri, Apr 16, 2021 at 11:19 PM Craig Macdonald @.***> wrote:
- Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?
- Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.
- Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGEPRF3RF74Q46KXBUTTJER7FANCNFSM43CXSTLQ .
I try to work with gzip files, as they can be read and written in streams (indeed, I patched the bert_term_sample_to_json.py script to write gzip files automatically). Generated using m=100, the output deepctcollection.gz
is much smaller than test_results.tsv.zip/gz
$ls -lh
total 8.3G
-rw-r--r-- 1 craigm csstaff 446M Apr 20 11:53 deepctcollection.gz
-rw-r--r-- 1 craigm csstaff 4.0G Apr 16 22:24 test_results.tsv.gz
-rw-r--r-- 1 craigm csstaff 4.0G Nov 26 2019 test_results.tsv.zip
(pyterrier) [craigm@trhead collection_pred_1]$less deepctcollection.gz
I also had to align the docids to account for empty documents, by changing bert_term_sample_to_json.py as follows:
if not selected_tokens:
output_file.write(did + '\t' + ' \n') # added by craig
e += 1
continue
Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files.
In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.