Anchor Text for msmarco-document and msmarco-document-v2

mam10eks commented 2 years ago

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

Dataset ID: msmarco-document/anchor-text and msmarco-document-v2/anchor-text

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

[x] Dataset definition (in ir_datasets/datasets/[topid].py)
[x] Tests (in tests/integration/[topid].py)
[x] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
[x] Documentation (in ir_datasets/etc/[topid].yaml)
- [x] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
[x] Downloadable content (in ir_datasets/etc/downloads.json)
- [x] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- [x] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

mam10eks commented 2 years ago

It looks like I have the basic integration working. In the current version, the anchor texts pointing to a document are concatenated to produce the text of a GenericDoc. I think it might also be helpful to build a second representation where a document has a list of anchors (i.e., not concatenating them) because this might be helpful for training models like DeepCT.

seanmacavaney commented 2 years ago

Thanks for contributing @mam10eks!

There's no strong rule about where datasets belong in the hierarchy. But I think I think I lean towards putting them under msmarco-document/anchor-text and msmarco-document-v2/anchor-text because it feels natural there. There's precedent for something similar to this, too, e.g., cord19 includes titles and abstracts for documents, while cord19/fulltext provides the article full text content (which are auxiliary and from other files).

I agree that both formats would be useful. This comes down to two central use cases identified in #72 -- cases where the user just wants the text as unstructured as possible (e.g., easy for indexing, re-ranking, etc.) and those where they want all possible information the dataset exposes (e.g., for your case about a particular way to train DeepCT). We have a plan to address this, but in the meantime, the general approach we've been going is providing both. So the doc object could provide: doc_id, text (as str, concat'd version of the anchors), and anchors (as a List[str] individually), + any other fields your dataset provides.

Let me know if you have any other questions or need help adding it.

mam10eks commented 2 years ago

Thanks @seanmacavaney for the feedback!

I have changed the implementation accordingly so that the anchor-text-documents now provide the doc_id, text, and anchors.

The main parts are done, but two things are still missing:

I have not generated the Metadata because it looks like this needs to download all other datasets as well
I was able to generate the Documentation, but I had to change a small thing in the associated script since otherwise the script failed for some Optional datatypes from other datasets and I would not like to push my changes in the script since they are only a workaround. But I checked that the generated documentation for the anchor-text looks like expected:

Would it be ok when we merge the current state and you can help with the generation of the metadata and documentation?

seanmacavaney commented 2 years ago

Awesome, thanks 🤘! This looks great. I opened a PR for this, and I'll take care of the metadata and documentation.

mam10eks commented 2 years ago

Nice, thanks! Please let me know when I can help further (I already saw that the automated checks failed, but this seems to be caused by the missing metadata).

seanmacavaney commented 2 years ago

@mam10eks -- can you accept the PR here with the metadata when you get a chance? https://github.com/mam10eks/ir_datasets/pull/1

seanmacavaney commented 2 years ago

Excellent, thanks again @mam10eks!

allenai / ir_datasets

Anchor Text for msmarco-document and msmarco-document-v2 #154