"every doc needs a text field/property" policy

seanmacavaney commented 3 years ago

Is your feature request related to a problem? Please describe.

@andrewyates points out these two use cases when working with docs:

do something sane for the situation where you want to swap out the datasets without specializing your model to one (e.g., treat cord section headers as part of the paragraph)
tailor your approach to collection-specific markup, like cord section headers, extra WaPo fields, maybe dates in robust04, etc

Right now the design favours (2) -- trying to remain as true to the original dataset as possible. But the (1) use case is still pretty valuable, and right now it can be tricky -- especially when some fields contain lists of sub-elements. (Without this, it's not so hard to always have the user specify a list of fields to concatenate.)

Describe the solution you'd like

Enforce a new policy that all document types need a text field that concatenates all "valuable" text (judgment call here), without markup. To avoid excessive redundancy/memory/etc., this could be in the form of a property of the named tuple, like so:

>>> from typing import NamedTuple
>>> class Doc(NamedTuple):
...  doc_id: str
...  title: str
...  body: str
...  @property
...  def text(self):
...   return f'{self.title} {self.body}'
... 
>>> Doc('1', 'title', 'body').text
'title body'

Describe alternatives you've considered

As mentioned above, the original design was pushing the concatenation work on to the user. But this gets much more challenging for list-based fields. This technique can still be used, even if there's this new policy.

Additional context

We already have a policy where all documents need a doc_id.

Do we need a similar policy for queries? Probably not, as the query fields tend to be distinct representations of the information need---which probably shouldn't be combined, as that would not really reflect a real-life situation.

Since properties are not represented in the NamedTuple fields, how is this incorporated into the documentation?

Does any code that relies on _fields or __annotations__ need to be updated to reflect the new text property?

Does this have an effect on pickled content (i.e., in built docstores)? Will we need to include migration logic for a ton of datasets (uggh) or will the new property flow through automatically in new versions?

Is there a complication with the trend for approaches to passage the document body and prepend the title to every passage (e.g., in ColBERT and others, I think)? This would mean that the first passage may have the title repeated, for instance.

How much is too much work to be done in the property? Could HTML parsing be done there? Should it be cached once processed?

seanmacavaney commented 3 years ago

Re-visit pyterrier examples (#73) when this change is made. Should it always use text instead of specifying individual fields? Should this be the new default setting in pyterrier itself?

seanmacavaney commented 2 years ago

Chatting with @eugene-yang -- having this as a method called .default_text() may be better.

It shows the user that it may involve some compute (as a method instead of a property)
It indicates that it's just the "default" configuration -- not necessarily all the text.
Doesn't conflict with existing text fields

seanmacavaney commented 1 year ago

Starting on this. Here's a list of all NamedTuples for queries and docs:

[x] ir_datasets/datasets/aol_ia.py: AolIaDoc
[x] ir_datasets/datasets/beir.py: BeirDoc
[x] ir_datasets/datasets/beir.py: BeirTitleDoc
[x] ir_datasets/datasets/beir.py: BeirTitleUrlDoc
[ ] ir_datasets/datasets/beir.py: BeirSciDoc
[x] ir_datasets/datasets/beir.py: BeirCordDoc
[ ] ir_datasets/datasets/beir.py: BeirToucheDoc
[x] ir_datasets/datasets/beir.py: BeirCqaDoc
[ ] ir_datasets/datasets/beir.py: BeirUrlQuery
[x] ir_datasets/datasets/beir.py: BeirSciQuery
[ ] ir_datasets/datasets/beir.py: BeirToucheQuery
[x] ir_datasets/datasets/beir.py: BeirCovidQuery
[x] ir_datasets/datasets/beir.py: BeirCqaQuery
[x] ir_datasets/datasets/c4.py: C4Doc
[x] ir_datasets/datasets/c4.py: MisinfoQuery
[x] ir_datasets/datasets/car.py: CarQuery
[x] ir_datasets/datasets/clinicaltrials.py: ClinicalTrialsDoc
[x] ir_datasets/datasets/clueweb09.py: TrecWebTrackQuery
[x] ir_datasets/datasets/clueweb12.py: TrecWebTrackQuery
[x] ir_datasets/datasets/clueweb12.py: NtcirQuery
[x] ir_datasets/datasets/clueweb12.py: MisinfoQuery
[x] ir_datasets/datasets/codec.py: CodecDoc
[x] ir_datasets/datasets/codec.py: CodecQuery
[ ] ir_datasets/datasets/codesearchnet.py: CodeSearchNetDoc
[x] ir_datasets/datasets/cord19.py: Cord19Doc
[x] ir_datasets/datasets/cord19.py: Cord19FullTextDoc
[x] ir_datasets/datasets/cranfield.py: CranfieldDoc
[x] ir_datasets/datasets/dpr_w100.py: DprW100Doc
[x] ir_datasets/datasets/dpr_w100.py: DprW100Query
[x] ir_datasets/datasets/gov.py: GovWeb02Query
[x] ir_datasets/datasets/gov.py: GovDoc
[x] ir_datasets/datasets/gov2.py: Gov2Doc
[x] ir_datasets/datasets/highwire.py: HighwireDoc
[x] ir_datasets/datasets/istella22.py: Istella22Doc
[x] ir_datasets/datasets/kilt.py: KiltDoc
[x] ir_datasets/datasets/medline.py: MedlineDoc
[x] ir_datasets/datasets/medline.py: TrecGenomicsQuery
[x] ir_datasets/datasets/medline.py: TrecPm2017Query
[x] ir_datasets/datasets/medline.py: TrecPmQuery
[x] ir_datasets/datasets/msmarco_document.py: MsMarcoDocument
[x] ir_datasets/datasets/msmarco_document.py: MsMarcoAnchorTextDocument
[x] ir_datasets/datasets/msmarco_document_v2.py: MsMarcoV2Document
[x] ir_datasets/datasets/msmarco_document_v2.py: MsMarcoV2AnchorTextDocument
[x] ir_datasets/datasets/msmarco_passage_v2.py: MsMarcoV2Passage
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnAQuery
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnAEvalQuery
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnADoc
[x] ir_datasets/datasets/natural_questions.py: NqPassageDoc
[x] ir_datasets/datasets/nfcorpus.py: NfCorpusDoc
[x] ir_datasets/datasets/nfcorpus.py: NfCorpusQuery
[ ] ir_datasets/datasets/nfcorpus.py: NfCorpusVideoQuery
[x] ir_datasets/datasets/nyt.py: NytDoc
[x] ir_datasets/datasets/pmc.py: PmcDoc
[x] ir_datasets/datasets/pmc.py: TrecCdsQuery
[x] ir_datasets/datasets/pmc.py: TrecCds2016Query
[x] ir_datasets/datasets/trec_cast.py: Cast2019Query
[x] ir_datasets/datasets/trec_cast.py: Cast2020Query
[x] ir_datasets/datasets/trec_fair.py: FairTrecDoc
[x] ir_datasets/datasets/trec_fair.py: FairTrec2022Doc
[x] ir_datasets/datasets/trec_fair.py: FairTrecQuery
[x] ir_datasets/datasets/trec_fair.py: FairTrec2022TrainQuery
[x] ir_datasets/datasets/trec_fair.py: FairTrecEvalQuery
[x] ir_datasets/datasets/trec_mandarin.py: TrecMandarinQuery
[x] ir_datasets/datasets/trec_spanish.py: TrecDescOnlyQuery
[x] ir_datasets/datasets/trec_spanish.py: TrecSpanish3Query
[x] ir_datasets/datasets/trec_spanish.py: TrecSpanish4Query
[x] ir_datasets/datasets/tripclick.py: TripClickPartialDoc
[x] ir_datasets/datasets/tweets2013_ia.py: TweetDoc
[x] ir_datasets/datasets/tweets2013_ia.py: TrecMb13Query
[x] ir_datasets/datasets/tweets2013_ia.py: TrecMb14Query
[x] ir_datasets/datasets/wapo.py: WapoDoc
[ ] ir_datasets/datasets/wapo.py: TrecBackgroundLinkingQuery
[x] ir_datasets/datasets/wikiclir.py: WikiClirQuery
[x] ir_datasets/datasets/wikiclir.py: WikiClirDoc
[ ] ir_datasets/formats/argsme.py: ArgsMeDoc
[ ] ir_datasets/formats/argsme.py: ArgsMeProcessedDoc
[x] ir_datasets/formats/base.py: GenericDoc
[x] ir_datasets/formats/base.py: GenericQuery
[x] ir_datasets/formats/extracted_cc.py: ExctractedCCDoc
[x] ir_datasets/formats/extracted_cc.py: ExctractedCCQuery
[x] ir_datasets/formats/touche.py: ToucheQuery
[x] ir_datasets/formats/touche.py: ToucheTitleQuery
[x] ir_datasets/formats/touche.py: ToucheComparativeQuery
[x] ir_datasets/formats/touche.py: TouchePassageDoc
[ ] ir_datasets/formats/touche_image.py: ToucheImageDoc
[x] ir_datasets/formats/trec.py: TrecDoc
[x] ir_datasets/formats/trec.py: TitleUrlTextDoc
[x] ir_datasets/formats/trec.py: TrecParsedDoc
[x] ir_datasets/formats/trec.py: TrecQuery
[ ] ir_datasets/formats/webarc.py: WarcDoc

allenai / ir_datasets

"every doc needs a text field/property" policy #72