castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.03k stars 458 forks source link

Paragraph indexing #312

Closed lintool closed 5 years ago

lintool commented 6 years ago

Raised by @tuzhucheng as part of #311: How should we handle paragraph indexing in a more generic way? We shouldn't have separate Wikipedia and WikipediaParagraph collections. There should be a more generic -paragraph option as part of IndexCollection.

Here's my proposal, in addition to content method in SourceDocument, we add a paragraph method that returns List<String> - the paragraphs to be indexed. Each collection can "define" what it means by a paragraph, and the -paragraph option in IndexCollection just indexes those paragraphs.

If a collection doesn't support paragraph indexing, it can just throw an UnsupportedOperationException.

Thoughts?

tuzhucheng commented 6 years ago

I like this idea of adding a paragraph method to the SourceDocument!

Peilin-Yang commented 6 years ago

I think the -paragraph should be independent to Collection or SourceDocument. Unlike indexing at document level, paragraph-based indexing does not have a clear standard - there is no correct or incorrect split scheme. For example, one can split the document into paragraphs by any regex expression; another way to split the document might be just limiting the number of terms of each paragraph, etc, etc.

For indexing, -paragraph should come with a scheme and Collection just sends it to SourceDocument. SourceDocument splits the document following the scheme and either return the paragraphs or throws errors if encountered issues.

And we should keep in mind that searching over paragraphs is much complicated. Often, what really desired is to rank the entire document by weighing the scores of its paragraph. For example, we can rank the documents by the scores of their first paragraph or by the max scores of any paragraphs in the documents or more complicated scores of combining paragraph scores of the documents. And keep in mind that if we split a document into paragraphs then the document statistics are probably unavailable and the all the collection statistics change - basically a completely new collection, which may or may not what we want.

To me, building the paragraph-based index is not easy. I view it more like a distributed index in terms of complexity. We can surely have just a paragraph indexing scheme w/o considering too much of the questions I raise above and having other code to play with two indexes (document and paragraph). But at least for now I am hesitant to design and implement it.

tuzhucheng commented 6 years ago

Here is some context for the particular use case I had in mind when I created my pull request: essentially we are building a free-text question answering system using Wikipedia as the corpus and we would like to explore indexing by paragraphs instead of whole articles (@lintool please correct me if I misunderstood you). I was thinking that we directly retrieve paragraphs as if they were documents as a simple approach, instead of retrieving documents (articles) based on some aggregate score of its paragraphs and then rank the most relevant paragraphs from the documents.

I agree from looking at a few articles from the Wikipedia collection that there is no standard scheme of defining what exactly is a paragraph. A possible design is to change -paragraph to -paragraph <scheme>, which will populate List<String> of paragraphs inside SourceDocument based on the scheme?

Each scheme implements the ParagraphSegmentationScheme base class which has access to the SouceDocument and its content and applies some paragraph segmentation scheme to get the paragraphs.

lintool commented 6 years ago

@Peilin-Yang I think this is pretty close to my proposal? The indexer delegates to the paragraphs method of SourceDocument, which determines the collection specifics of how to chop up a document.

If you're not opposed, let's let @tuzhucheng prototype something out and let's see how it works...

BTW, I've given the collection organization some though - actually Collection shouldn't implement Iterator<SourceDocument> because then it would be very difficult to have multi-threaded indexing...

Collection should implement Iterator<FileSegment>, so we can at least have parallelism at the segment level.

As to non-static collections, I think it's beyond the scope of our design for now...

lintool commented 6 years ago

@tuzhucheng I don't like the scheme idea - introduces another level of indirection that we don't know if we need yet...

Peilin-Yang commented 6 years ago

I think this is pretty close to my proposal?

@lintool I do not think so.... SourceDocument should be agnostic to how to split the document and there should not be any collection specific paragraph splitting scheme. Instead, paragraph splitting should be an independent lib and when enabled in the indexing it should be directly injected to SourceDocument as part of document processing or consume the document returned by SourceDocument.

If you're not opposed

No. I am not against implementing the paragraph indexing, although I strongly believe Paragraph should be a separated package from Document.

actually Collection shouldn't implement Iterator because then it would be very difficult to have multi-threaded indexing...

Got it. I think you will take care of refactoring Collection and Document?

Peilin-Yang commented 6 years ago

Each scheme implements the ParagraphSegmentationScheme base class which has access to the SouceDocument and its content and applies some paragraph segmentation scheme to get the paragraphs.

Yes, I agree with the idea. But like I said above - the implementation could be also: Injectable to SourceDocument in order to take advantage of buffered reading instead of directly consume the large chunk of string returned by readNextRecord.

lintool commented 6 years ago

@tuzhucheng Okay, I think you have enough input to propose something... use your judgment? I don't think any of us know enough to design this "properly" at this point.

tuzhucheng commented 6 years ago

Rethinking about this after prototyping the paragraphs approach, it is rather difficult to create paragraphs from the content string. All the delimiters for identifying sections are cleaned up and gone in the content string, which is performed in FileSegment's Document iterator.

Peilin-Yang commented 6 years ago

Like I said, previous works have used length-based (fixed terms per segment) or bin-based (fixed bins per document) segmentation and they work not bad. See: https://ciir-publications.cs.umass.edu/getpdf.php?id=1242 and https://www.eecis.udel.edu/~hfang/pubs/sigir06-expansion.pdf for example. If you would like to apply these ideas then I think it will not be that complicated: you can put the post-processing right in the Generator before a document is added to the index. My idea is to keep id and contents and add segment-i fields. If you want to query all segments then you can construct your query as field:segment* (I am not sure if this works though). Or you can also index the separated segments directly if that's what you want.

Splitting based on string patterns is just unrealistic for unstructured text IMHO.

tuzhucheng commented 6 years ago

Thank you for the pointers @Peilin-Yang! I will read about those segmentation methods.

lintool commented 6 years ago

@Kytabyte has written a paragraph segmented in Python for newswire. The quick-and-dirty idea is to pre-segment the collection and write out a new collection, then feed into Anserini indexer.

While we're figuring things out, this could be a reasonable step forward...

lintool commented 6 years ago

See #358

lintool commented 5 years ago

Based on more recent discussions with @Victor0118 in the context of Wikipedia - I'm leaning more towards the above solution: pre-segment the collection via some external script or program to generate a new collection to feed into Anserini.

Closing for now, unless we come up with a cleaner solution.