Huffon / sentence-compressor

Compress your lengthy sentence 🗜️
7 stars 1 forks source link

Is this only extractive? #1

Closed Hellisotherpeople closed 3 years ago

Hellisotherpeople commented 4 years ago

It looks to be the holy grail of what I've always looked for - a grammatically correct word level highlighting tool.

You've starred my CX_DB8 repo. Could I simply run your tool over each sentence for a better highlighting experience rather than my current method?

Huffon commented 4 years ago

Hi Allen, Thank you for your interest!

First of all, there are two types of datasets in the Sentence Compression task. One is extraction-based datasets you've mentioned, And the other is abstraction-based datasets like Gigaword.

So If you train your compression model with only extraction-based datasets, I think yes, you can use the compression model as a high-lighting tool.

But as far as I know, there is not that much amount of extraction-based datasets. And another concern is the compression model in this repository based on BART architecture. So it might be too heavy to use it as a highlighting tool.

In preprocess.py, you can find extractive-based datasets except Gigaword.

Hellisotherpeople commented 4 years ago

I actually plan on making a new dataset with both word level extracts and abstracts from a diverse set of documents. Hopefully a conference will be ready to publish it because I will be submitting to recent NLP conferences.

Is the pretrained BART model trained without Gigaword trained only on extractive datasets? Or do other abstractive datasets make it in to there? If its only trained on extractive datasets than I'm super excited and will try this out as soon as I can

As far as highlighting goes, I'm okay if it's really compute intensive or GPU based. At this point, I'd only be using it for personal experiments and creating debate evidence for students who I coach.

If I can get training working with this, I will contribute trained extractive "highlighter" models and likely be pointing folks who want a more sophisticated version of CX_DB8 to you because I've searched long and hard for a high quality modern solution to the highlighting issue and the only "advantage" that my method would offer is "biasable" summarization (summarization based on a query sentence) and not requiring additional training - but grammatical correctness is far more important.

Huffon commented 4 years ago

Yes. In the Example section, the result of BART (fine-tuned without Gigaword) means that BART is only trained on an extract basis. data set.

I also uploaded fine tuned version of BART for your quick testing. You can download and test it with your sentences.

At the moment, I think this is a PoC (Proof of Concept) level project. And I will keep improving this elaborately :)

Hope this project helps with CX_DB8.