comments/questions - Githubissues

cmacdonald commented 3 years ago

Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?
Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.
Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream

AdeDZY commented 3 years ago

Hi Craig,

"What's the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?" -> Sorry for the confusion, they each contain half of MSMARCO passage collection.

"Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use." -> Thanks for the suggestion. I didn't include those because they are huge as terms are being repeated. I'll try to find those files and add to the data folder.

On Fri, Apr 16, 2021 at 11:19 PM Craig Macdonald @.***> wrote:

Whats the difference between collection_pred_1 and collection_pred_2? Is this MSMARCO passage vs document corpora?

Providing a version of each corpus with the bert_term_sample_to_json.py already applied would be easier to use.

Keeping files as .tsv.zip isnt as helpful as for instance keeping them as .tsv.gz which can be directly opened as a stream

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AdeDZY/DeepCT/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHQHGEPRF3RF74Q46KXBUTTJER7FANCNFSM43CXSTLQ .

cmacdonald commented 3 years ago

I try to work with gzip files, as they can be read and written in streams (indeed, I patched the bert_term_sample_to_json.py script to write gzip files automatically). Generated using m=100, the output deepctcollection.gz is much smaller than test_results.tsv.zip/gz

$ls -lh
total 8.3G
-rw-r--r-- 1 craigm csstaff 446M Apr 20 11:53 deepctcollection.gz
-rw-r--r-- 1 craigm csstaff 4.0G Apr 16 22:24 test_results.tsv.gz
-rw-r--r-- 1 craigm csstaff 4.0G Nov 26  2019 test_results.tsv.zip
(pyterrier) [craigm@trhead collection_pred_1]$less deepctcollection.gz

I also had to align the docids to account for empty documents, by changing bert_term_sample_to_json.py as follows:

            if not selected_tokens:
                output_file.write(did + '\t' + ' \n') # added by craig
                e += 1
                continue

AdeDZY commented 3 years ago

Thanks for providing the numbers! I have updated the data folder with test_results.tsv.gz files.

In addition, I also uploaded the bert_term_sample_to_json.py output for MS MARCO at weighted_documents/.

AdeDZY / DeepCT

comments/questions #14