allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
314 stars 42 forks source link

Anchor Text for msmarco-document and msmarco-document-v2 #154

Closed mam10eks closed 2 years ago

mam10eks commented 2 years ago

Dataset Information:

We have extracted anchor text pointing to documents in MS MARCO (version 1 and version 2) from several Common Crawl snapshots that can be used as additional retrieval features or for the training of models (e.g., in a distant supervision style like DeepCT).

Links to Resources:

Dataset ID(s) & supported entities:

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Additional comments/concerns/ideas/etc.

I would be happy to help integrate the anchor texts into ir_datasets. I am not sure what a good Dataset ID would be, it can make sense to integrate it as a subset into the existing msmarco-document and msmarco-document-v2 Ids but it might also make sense to have it as independent Ids.

mam10eks commented 2 years ago

It looks like I have the basic integration working. In the current version, the anchor texts pointing to a document are concatenated to produce the text of a GenericDoc. I think it might also be helpful to build a second representation where a document has a list of anchors (i.e., not concatenating them) because this might be helpful for training models like DeepCT.

seanmacavaney commented 2 years ago

Thanks for contributing @mam10eks!

There's no strong rule about where datasets belong in the hierarchy. But I think I think I lean towards putting them under msmarco-document/anchor-text and msmarco-document-v2/anchor-text because it feels natural there. There's precedent for something similar to this, too, e.g., cord19 includes titles and abstracts for documents, while cord19/fulltext provides the article full text content (which are auxiliary and from other files).

I agree that both formats would be useful. This comes down to two central use cases identified in #72 -- cases where the user just wants the text as unstructured as possible (e.g., easy for indexing, re-ranking, etc.) and those where they want all possible information the dataset exposes (e.g., for your case about a particular way to train DeepCT). We have a plan to address this, but in the meantime, the general approach we've been going is providing both. So the doc object could provide: doc_id, text (as str, concat'd version of the anchors), and anchors (as a List[str] individually), + any other fields your dataset provides.

Let me know if you have any other questions or need help adding it.

mam10eks commented 2 years ago

Thanks @seanmacavaney for the feedback!

I have changed the implementation accordingly so that the anchor-text-documents now provide the doc_id, text, and anchors.

The main parts are done, but two things are still missing:

Would it be ok when we merge the current state and you can help with the generation of the metadata and documentation?

seanmacavaney commented 2 years ago

Awesome, thanks 🤘! This looks great. I opened a PR for this, and I'll take care of the metadata and documentation.

mam10eks commented 2 years ago

Nice, thanks! Please let me know when I can help further (I already saw that the automated checks failed, but this seems to be caused by the missing metadata).

seanmacavaney commented 2 years ago

@mam10eks -- can you accept the PR here with the metadata when you get a chance? https://github.com/mam10eks/ir_datasets/pull/1

seanmacavaney commented 2 years ago

Excellent, thanks again @mam10eks!