dmmiller612 / bert-extractive-summarizer

Easy to use extractive text summarization with BERT
MIT License
1.38k stars 305 forks source link

Weird sentence splitting #67

Open Magdiel3 opened 4 years ago

Magdiel3 commented 4 years ago

Weird sentence splitting

I am currently using this summarizer for German text but I have been getting some issues with sentences being split at abbreviations. To list an example, I have the sentence Führerschein Kl. B, sowie eigener PKW (wünschenswert) inside a larger body of text, and somehow that sentence gets to the summarized output. The issue is that it gets there but splitted at Kl. B (just the B, sowie eigener PKW (wünschenswert) part )

My attemp

I tried adding some abbreviations (as it is explained here) to the Tokenizer to prevent these error from happening. My model and tokenizer are the followings:

german_missing_tokens = ['ca.','bzw.','Du','Dein','Deinen','-','Kl.']
bertgerman_model = BertModel.from_pretrained('bert-base-german-cased', output_hidden_states=True,)
bertgerman_tokenizer = BertTokenizer.from_pretrained('bert-base-german-cased',never_split=german_missing_tokens,do_basic_tokenize=True)

I follow trough the code and found that the SentenceHandler uses SpaCy to extract the sentences, so is it really doing something adding the abbreviations to the Tokenizer or should I do that in the nlp pipeline?

dmmiller612 commented 4 years ago

The tokenizer works by automatically splitting sentences based on punctuation. I am guessing the spacy sees the abbreviation, and wrongfully treats it as a sentence. You will probably need to create a sentence handler with the german language. You could copy what already exists for the current sentence handler and replace english with: de_core_news_lg

Magdiel3 commented 4 years ago

Thanks, I tried adding the model but it was not available for spaCy v2.1.3 but anyways, default sentencizer from de_core_news_lg doesn't handles some cases correctly. I modified the handler to work with this sentence splitter for German and English that performed better at sentence segmentation. That works fine and all and still not quite the summirizing I want, which I'm still working of figuring out how to get there, but the thing is that when I wanted to use CoreferenceHandler then different sentence splitting is taken. Should I really push on training the coreference for German or is there another workaround? And last question, Would it be better or more useful to add these abbreviations handling on spacy to use them somewhere else?