fangfangli / cleartk

Automatically exported from code.google.com/p/cleartk
0 stars 0 forks source link

cleartk-summary #294

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We should add a cleartk-summary project with implementations of common 
summarization algorithms, e.g. SumBasic. Some thoughts on how we could 
implement this:

(1) An extractive summarizer would be a Classifier<Boolean>, tagging each 
sentence that should be included in the summary as "true".

(2) Most summarizers would probably use the InstanceDataWriter, and work with 
the List<Feature>s for each sentence. (In most cases, these Features would just 
be token strings from a CoveredTextExtractor, but summarizing based on other 
Features would of course be possible too.)

(3) During training, the summarizer would do its thing with the sentences, and 
save as part of its model a Set<List<Feature>>, where each List<Feature> was a 
sentence selected for the summary.

(3) When classify(List<Feature>) was called, the summarizer would just return 
whether or not that sentence is one of the selected ones in the 
Set<List<Feature>>.

To actually get a summary, you'd first run your training pipeline over your 
collection, and then run your classifying pipeline over the same collection. 
We'd probably also provide an output writer that writes all SummarySentence 
annotations, one per line, to a file.

This approach should work for either single document summarization or 
multi-document summarization - the classifier wouldn't have to know which way 
it was being used.

Original issue reported on code.google.com by steven.b...@gmail.com on 20 Mar 2012 at 12:42

GoogleCodeExporter commented 9 years ago

Original comment by steven.b...@gmail.com on 24 Jul 2012 at 5:35

GoogleCodeExporter commented 9 years ago
I've started taking a first pass at coding this up, but I'm getting hung up on 
is how do you tie the List<Feature> back to the sentence in the CAS?  It seems 
straightforward enough to manipulate your Set<List<Feature>> to produce either 
a ranked list or to select a subset for extraction, but there doesn't really 
seem to be a good mechanism to point this instance back to the source sentence.

Some possible, but less than perfect solutions include:
* Making it a List<List<Feature>> and relying on the strict ordering to know 
which sentences to label as extracted.  This feels cumbersome.
* Somehow hashing the sentence text into a UUID, and storing that as feature 
for later lookup. Or instead of a UUID, use the document URI plus the sentence 
span to uniquely identify the sentence.  It seems a little ugly to store 
meta-data as a feature, as you really don't want to expose this to something 
like an SVM.  Alternatively, this could possibly be handled via a new 
TransformableFeature extractor that stores metadata during training and that 
returns an empty list during classification.

Original comment by lee.becker on 7 Sep 2012 at 5:24

GoogleCodeExporter commented 9 years ago
Assuming the summarizer requires sentence segmentation, tokenization, stemming, 
and pos-tags, is there any way to avoid re-running the preprocessing for the 
classification stage?  While it's possible to write out xmis at the end of 
training, and read them back in with an XMIReader, that doesn't seem much 
better than re-running everything.

What I really want is something like JCasIterable that I can iterate over 
multiple times.

Original comment by lee.becker on 7 Sep 2012 at 5:37

GoogleCodeExporter commented 9 years ago
So you don't need to store any link back to the sentence. Your classifier can 
look like:

    private Set<List<Feature>> selectedSentences; // the actual model

    public Boolean classify(List<Feature> sentence) {
        return selectedSentences.contains(sentence);
    }

As to your second question, I don't understand what's wrong with the XMI 
approach. I would just:

(1) Run all the preprocessing and save the XMIs
(2) Run the training on the XMIs
(3) Run the classification on the XMIs

JCasIterable won't really work for you because it only takes one pass through 
the JCases. Of course, you could look at the source fo  JCasIterable (it's 
pretty simple, really) and design a two-pass version for yourself. But I don't 
think that would be any simpler than just doing the standard XMI approach...

Original comment by steven.b...@gmail.com on 7 Sep 2012 at 8:25

GoogleCodeExporter commented 9 years ago
An initial version of this was checked in with revision 
4bb30747a5f81bababfd80107d763cba4e0592bc. It hasn't been thoroughly tested in 
the wild yet, but as of Issue 370 the cleartk-summarization code all has @Beta 
annotations, so I think we can safely mark this issue as resolved. If there 
needs to be future development on cleartk-summarization, then we'll open up new 
issues for that.

Original comment by steven.b...@gmail.com on 19 Jul 2013 at 6:22