Input for Single-Doc Summerization

Yale-LILY / SummerTime

An open-source text summarization toolkit for non-experts. EMNLP'2021 Demo

https://arxiv.org/abs/2108.12738

Apache License 2.0

264 stars 30 forks source link

Input for Single-Doc Summerization #113

Closed johnhutx closed 2 years ago

johnhutx commented 2 years ago

Hello, Is it possible to provide a list of (already split) sentences as the source input to the summarizer, as opposed to a single source document? The goal is to treat each list of sentences as one long sequence during extractive summarization.

niansong1996 commented 2 years ago

Hi, I am not sure if I understand it correctly. Do you want to provide a List[List[str]] where the first layer is the list of documents and the second layer is the list of sentences in that document?

If so, what's the level of extraction for the extractive summarization?

johnhutx commented 2 years ago

Yes, I would want to provide a List[List[str]]. Instead of extracting at the sentence level, I would like to extract at the List[str] level. It's just like extracting a group of sentences every time.

niansong1996 commented 2 years ago

Hi @johnhutx, I am not sure if I understand the situation correctly, can you make an example? Is it possible to merge the inner list and use the current API instead?

johnhutx commented 2 years ago

No problem @niansong1996. Consider a TV screenplay that contains multiple scenes List[scenes]. I would like to extract the important scenes instead of sentences from the screenplay. Each scene usually contains multiple sentences, which can be represented as a List[str]. The goal is to extract the scenes (List[str]) from the screenplay (List[List[str]]).

niansong1996 commented 2 years ago

Thanks for the clarification, it's much clearer to me now.

If you would like to extract scenes using summarization, I assume there is no query? Does this mean that the model would need to figure out which scenes are more important than others?

For the task you described, I think probably the best choice is to make a subclass of our lexrank model here and customize it (change L27-40). In our implementation, we split the document into a list of sentences, but you could potentially input a list of scenes, each one of which is concatenated sentences from the scene.

Hope this is helpful.

johnhutx commented 2 years ago

Thank you for the clarification.