allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

cw12 and cw12/b13 verificiation and improved instructions #120

Open seanmacavaney opened 2 years ago

seanmacavaney commented 2 years ago

As reported by @searchivarius

Describe the bug Right now, a user can end up with a faulty b13 subset if they only have the b13 disk and follow the instructions as presented.

Affected dataset(s) clueweb12 and clueweb12/b13

To Reproduce Steps to reproduce the behavior:

  1. Request clueweb12/b13 -- you are instructed to link the corpus to ~/.ir_datasets/clueweb12/corpus/
  2. Link the B13 corpus to ~/.ir_datasets/clueweb12/corpus/
  3. Request clueweb12/b13 -- you are instructed to run CMU's CreateB13 jar to extract the B13 subset or link the b13 subset to ~/.ir_datasets/clueweb12/corpus-b13/
  4. Run CreateB13
  5. You end up with a subset of the B13 subset. E.g., 0000tw-00, which is supposed to have 1760 records (of 24644), ends up with only 125 (of the 1760 it sees).

Expected behavior You should not be able to easily end up with the faulty subset. If you link B13 to ~/.ir_datasets/clueweb12/corpus/, you should get an error. If you request clueweb12/b13 (without the full dataset available), you should be instructed that your two options are to link the full corpus and run the CreateB13 software, or (if you have the B13 subset already), simply link B13 to the proper location. B13 should be validated as well.

Additional context

Output of running CreateB13 on the full dataset vs on the B13 subset

java -jar CreateClueWeb12B13Dataset.jar /mnt/hdd1/ClueWeb12/ /mnt/hdd1/tmp/
creating directory: /mnt/hdd1/tmp/ClueWeb12_00
creating directory: /mnt/hdd1/tmp/ClueWeb12_00/0000tw
Page processing on: /mnt/hdd1/ClueWeb12/ClueWeb12_00/0000tw/0000tw-00.warc.gzand saving to: /mnt/hdd1/tmp/ClueWeb12_00/0000tw/0000tw-00.warc.gz
ClueWeb12_B13 Records Created: 1760 from 24644 records.
...

java -jar CreateClueWeb12B13Dataset.jar /mnt/hdd1/ClueWeb12b13/ /mnt/hdd1/tmp2/
creating directory: /mnt/hdd1/tmp2/ClueWeb12_00
creating directory: /mnt/hdd1/tmp2/ClueWeb12_00/0000tw
Page processing on: /mnt/hdd1/ClueWeb12b13/ClueWeb12_00/0000tw/0000tw-00.warc.gzand saving to: /mnt/hdd1/tmp2/ClueWeb12_00/0000tw/0000tw-00.warc.gz
ClueWeb12_B13 Records Created: 125 from 1760 records.
...
seanmacavaney commented 2 years ago

Related: #134