Closed lintool closed 5 years ago
I like this idea of adding a paragraph
method to the SourceDocument
!
I think the -paragraph
should be independent to Collection
or SourceDocument
.
Unlike indexing at document level, paragraph-based indexing does not have a clear standard - there is no correct or incorrect split scheme.
For example, one can split the document into paragraphs by any regex expression; another way to split the document might be just limiting the number of terms of each paragraph, etc, etc.
For indexing, -paragraph
should come with a scheme and Collection
just sends it to SourceDocument
. SourceDocument
splits the document following the scheme and either return the paragraphs or throws errors if encountered issues.
And we should keep in mind that searching over paragraphs is much complicated. Often, what really desired is to rank the entire document by weighing the scores of its paragraph. For example, we can rank the documents by the scores of their first paragraph or by the max scores of any paragraphs in the documents or more complicated scores of combining paragraph scores of the documents. And keep in mind that if we split a document into paragraphs then the document statistics are probably unavailable and the all the collection statistics change - basically a completely new collection, which may or may not what we want.
To me, building the paragraph-based index is not easy. I view it more like a distributed index in terms of complexity. We can surely have just a paragraph indexing scheme w/o considering too much of the questions I raise above and having other code to play with two indexes (document and paragraph). But at least for now I am hesitant to design and implement it.
Here is some context for the particular use case I had in mind when I created my pull request: essentially we are building a free-text question answering system using Wikipedia as the corpus and we would like to explore indexing by paragraphs instead of whole articles (@lintool please correct me if I misunderstood you). I was thinking that we directly retrieve paragraphs as if they were documents as a simple approach, instead of retrieving documents (articles) based on some aggregate score of its paragraphs and then rank the most relevant paragraphs from the documents.
I agree from looking at a few articles from the Wikipedia collection that there is no standard scheme of defining what exactly is a paragraph. A possible design is to change -paragraph
to -paragraph <scheme>
, which will populate List<String>
of paragraphs inside SourceDocument
based on the scheme?
Each scheme implements the ParagraphSegmentationScheme
base class which has access to the SouceDocument
and its content
and applies some paragraph segmentation scheme to get the paragraphs.
@Peilin-Yang I think this is pretty close to my proposal? The indexer delegates to the paragraphs
method of SourceDocument
, which determines the collection specifics of how to chop up a document.
If you're not opposed, let's let @tuzhucheng prototype something out and let's see how it works...
BTW, I've given the collection organization some though - actually Collection
shouldn't implement Iterator<SourceDocument>
because then it would be very difficult to have multi-threaded indexing...
Collection
should implement Iterator<FileSegment>
, so we can at least have parallelism at the segment level.
As to non-static collections, I think it's beyond the scope of our design for now...
@tuzhucheng I don't like the scheme idea - introduces another level of indirection that we don't know if we need yet...
I think this is pretty close to my proposal?
@lintool I do not think so.... SourceDocument
should be agnostic to how to split the document and there should not be any collection specific paragraph splitting scheme.
Instead, paragraph splitting should be an independent lib and when enabled in the indexing it should be directly injected to SourceDocument
as part of document processing or consume the document returned by SourceDocument
.
If you're not opposed
No. I am not against implementing the paragraph indexing, although I strongly believe Paragraph
should be a separated package from Document
.
actually Collection shouldn't implement Iterator
because then it would be very difficult to have multi-threaded indexing...
Got it. I think you will take care of refactoring Collection
and Document
?
Each scheme implements the ParagraphSegmentationScheme base class which has access to the SouceDocument and its content and applies some paragraph segmentation scheme to get the paragraphs.
Yes, I agree with the idea.
But like I said above - the implementation could be also:
Injectable to SourceDocument
in order to take advantage of buffered reading instead of directly consume the large chunk of string returned by readNextRecord
.
@tuzhucheng Okay, I think you have enough input to propose something... use your judgment? I don't think any of us know enough to design this "properly" at this point.
Rethinking about this after prototyping the paragraphs
approach, it is rather difficult to create paragraphs from the content
string. All the delimiters for identifying sections are cleaned up and gone in the content string, which is performed in FileSegment's Document iterator.
Like I said, previous works have used length-based (fixed terms per segment) or bin-based (fixed bins per document) segmentation and they work not bad.
See: https://ciir-publications.cs.umass.edu/getpdf.php?id=1242 and https://www.eecis.udel.edu/~hfang/pubs/sigir06-expansion.pdf for example.
If you would like to apply these ideas then I think it will not be that complicated: you can put the post-processing right in the Generator
before a document is added to the index.
My idea is to keep id
and contents
and add segment-i fields. If you want to query all segments then you can construct your query as field:segment* (I am not sure if this works though).
Or you can also index the separated segments directly if that's what you want.
Splitting based on string patterns is just unrealistic for unstructured text IMHO.
Thank you for the pointers @Peilin-Yang! I will read about those segmentation methods.
@Kytabyte has written a paragraph segmented in Python for newswire. The quick-and-dirty idea is to pre-segment the collection and write out a new collection, then feed into Anserini indexer.
While we're figuring things out, this could be a reasonable step forward...
See #358
Based on more recent discussions with @Victor0118 in the context of Wikipedia - I'm leaning more towards the above solution: pre-segment the collection via some external script or program to generate a new collection to feed into Anserini.
Closing for now, unless we come up with a cleaner solution.
Raised by @tuzhucheng as part of #311: How should we handle paragraph indexing in a more generic way? We shouldn't have separate Wikipedia and WikipediaParagraph collections. There should be a more generic
-paragraph
option as part ofIndexCollection
.Here's my proposal, in addition to
content
method inSourceDocument
, we add aparagraph
method that returnsList<String>
- the paragraphs to be indexed. Each collection can "define" what it means by a paragraph, and the-paragraph
option inIndexCollection
just indexes those paragraphs.If a collection doesn't support paragraph indexing, it can just throw an
UnsupportedOperationException
.Thoughts?