Open seanmacavaney opened 3 years ago
Re-visit pyterrier examples (#73) when this change is made. Should it always use text
instead of specifying individual fields? Should this be the new default setting in pyterrier itself?
Chatting with @eugene-yang -- having this as a method called .default_text()
may be better.
text
fieldsStarting on this. Here's a list of all NamedTuple
s for queries and docs:
[x] ir_datasets/datasets/aol_ia.py: AolIaDoc
[x] ir_datasets/datasets/beir.py: BeirDoc
[x] ir_datasets/datasets/beir.py: BeirTitleDoc
[x] ir_datasets/datasets/beir.py: BeirTitleUrlDoc
[ ] ir_datasets/datasets/beir.py: BeirSciDoc
[x] ir_datasets/datasets/beir.py: BeirCordDoc
[ ] ir_datasets/datasets/beir.py: BeirToucheDoc
[x] ir_datasets/datasets/beir.py: BeirCqaDoc
[ ] ir_datasets/datasets/beir.py: BeirUrlQuery
[x] ir_datasets/datasets/beir.py: BeirSciQuery
[ ] ir_datasets/datasets/beir.py: BeirToucheQuery
[x] ir_datasets/datasets/beir.py: BeirCovidQuery
[x] ir_datasets/datasets/beir.py: BeirCqaQuery
[x] ir_datasets/datasets/c4.py: C4Doc
[x] ir_datasets/datasets/c4.py: MisinfoQuery
[x] ir_datasets/datasets/car.py: CarQuery
[x] ir_datasets/datasets/clinicaltrials.py: ClinicalTrialsDoc
[x] ir_datasets/datasets/clueweb09.py: TrecWebTrackQuery
[x] ir_datasets/datasets/clueweb12.py: TrecWebTrackQuery
[x] ir_datasets/datasets/clueweb12.py: NtcirQuery
[x] ir_datasets/datasets/clueweb12.py: MisinfoQuery
[x] ir_datasets/datasets/codec.py: CodecDoc
[x] ir_datasets/datasets/codec.py: CodecQuery
[ ] ir_datasets/datasets/codesearchnet.py: CodeSearchNetDoc
[x] ir_datasets/datasets/cord19.py: Cord19Doc
[x] ir_datasets/datasets/cord19.py: Cord19FullTextDoc
[x] ir_datasets/datasets/cranfield.py: CranfieldDoc
[x] ir_datasets/datasets/dpr_w100.py: DprW100Doc
[x] ir_datasets/datasets/dpr_w100.py: DprW100Query
[x] ir_datasets/datasets/gov.py: GovWeb02Query
[x] ir_datasets/datasets/gov.py: GovDoc
[x] ir_datasets/datasets/gov2.py: Gov2Doc
[x] ir_datasets/datasets/highwire.py: HighwireDoc
[x] ir_datasets/datasets/istella22.py: Istella22Doc
[x] ir_datasets/datasets/kilt.py: KiltDoc
[x] ir_datasets/datasets/medline.py: MedlineDoc
[x] ir_datasets/datasets/medline.py: TrecGenomicsQuery
[x] ir_datasets/datasets/medline.py: TrecPm2017Query
[x] ir_datasets/datasets/medline.py: TrecPmQuery
[x] ir_datasets/datasets/msmarco_document.py: MsMarcoDocument
[x] ir_datasets/datasets/msmarco_document.py: MsMarcoAnchorTextDocument
[x] ir_datasets/datasets/msmarco_document_v2.py: MsMarcoV2Document
[x] ir_datasets/datasets/msmarco_document_v2.py: MsMarcoV2AnchorTextDocument
[x] ir_datasets/datasets/msmarco_passage_v2.py: MsMarcoV2Passage
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnAQuery
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnAEvalQuery
[x] ir_datasets/datasets/msmarco_qna.py: MsMarcoQnADoc
[x] ir_datasets/datasets/natural_questions.py: NqPassageDoc
[x] ir_datasets/datasets/nfcorpus.py: NfCorpusDoc
[x] ir_datasets/datasets/nfcorpus.py: NfCorpusQuery
[ ] ir_datasets/datasets/nfcorpus.py: NfCorpusVideoQuery
[x] ir_datasets/datasets/nyt.py: NytDoc
[x] ir_datasets/datasets/pmc.py: PmcDoc
[x] ir_datasets/datasets/pmc.py: TrecCdsQuery
[x] ir_datasets/datasets/pmc.py: TrecCds2016Query
[x] ir_datasets/datasets/trec_cast.py: Cast2019Query
[x] ir_datasets/datasets/trec_cast.py: Cast2020Query
[x] ir_datasets/datasets/trec_fair.py: FairTrecDoc
[x] ir_datasets/datasets/trec_fair.py: FairTrec2022Doc
[x] ir_datasets/datasets/trec_fair.py: FairTrecQuery
[x] ir_datasets/datasets/trec_fair.py: FairTrec2022TrainQuery
[x] ir_datasets/datasets/trec_fair.py: FairTrecEvalQuery
[x] ir_datasets/datasets/trec_mandarin.py: TrecMandarinQuery
[x] ir_datasets/datasets/trec_spanish.py: TrecDescOnlyQuery
[x] ir_datasets/datasets/trec_spanish.py: TrecSpanish3Query
[x] ir_datasets/datasets/trec_spanish.py: TrecSpanish4Query
[x] ir_datasets/datasets/tripclick.py: TripClickPartialDoc
[x] ir_datasets/datasets/tweets2013_ia.py: TweetDoc
[x] ir_datasets/datasets/tweets2013_ia.py: TrecMb13Query
[x] ir_datasets/datasets/tweets2013_ia.py: TrecMb14Query
[x] ir_datasets/datasets/wapo.py: WapoDoc
[ ] ir_datasets/datasets/wapo.py: TrecBackgroundLinkingQuery
[x] ir_datasets/datasets/wikiclir.py: WikiClirQuery
[x] ir_datasets/datasets/wikiclir.py: WikiClirDoc
[ ] ir_datasets/formats/argsme.py: ArgsMeDoc
[ ] ir_datasets/formats/argsme.py: ArgsMeProcessedDoc
[x] ir_datasets/formats/base.py: GenericDoc
[x] ir_datasets/formats/base.py: GenericQuery
[x] ir_datasets/formats/extracted_cc.py: ExctractedCCDoc
[x] ir_datasets/formats/extracted_cc.py: ExctractedCCQuery
[x] ir_datasets/formats/touche.py: ToucheQuery
[x] ir_datasets/formats/touche.py: ToucheTitleQuery
[x] ir_datasets/formats/touche.py: ToucheComparativeQuery
[x] ir_datasets/formats/touche.py: TouchePassageDoc
[ ] ir_datasets/formats/touche_image.py: ToucheImageDoc
[x] ir_datasets/formats/trec.py: TrecDoc
[x] ir_datasets/formats/trec.py: TitleUrlTextDoc
[x] ir_datasets/formats/trec.py: TrecParsedDoc
[x] ir_datasets/formats/trec.py: TrecQuery
[ ] ir_datasets/formats/webarc.py: WarcDoc
Is your feature request related to a problem? Please describe.
@andrewyates points out these two use cases when working with docs:
Right now the design favours (2) -- trying to remain as true to the original dataset as possible. But the (1) use case is still pretty valuable, and right now it can be tricky -- especially when some fields contain lists of sub-elements. (Without this, it's not so hard to always have the user specify a list of fields to concatenate.)
Describe the solution you'd like
Enforce a new policy that all document types need a
text
field that concatenates all "valuable" text (judgment call here), without markup. To avoid excessive redundancy/memory/etc., this could be in the form of a property of the named tuple, like so:Describe alternatives you've considered
As mentioned above, the original design was pushing the concatenation work on to the user. But this gets much more challenging for list-based fields. This technique can still be used, even if there's this new policy.
Additional context
We already have a policy where all documents need a
doc_id
.Do we need a similar policy for queries? Probably not, as the query fields tend to be distinct representations of the information need---which probably shouldn't be combined, as that would not really reflect a real-life situation.
Since properties are not represented in the NamedTuple fields, how is this incorporated into the documentation?
Does any code that relies on
_fields
or__annotations__
need to be updated to reflect the newtext
property?Does this have an effect on pickled content (i.e., in built docstores)? Will we need to include migration logic for a ton of datasets (uggh) or will the new property flow through automatically in new versions?
Is there a complication with the trend for approaches to passage the document body and prepend the title to every passage (e.g., in ColBERT and others, I think)? This would mean that the first passage may have the title repeated, for instance.
How much is too much work to be done in the property? Could HTML parsing be done there? Should it be cached once processed?