amir-zeldes / HebPipe

An NLP pipeline for Hebrew
Other
34 stars 9 forks source link

[Help needed] Sentence segmentation does not segment #14

Closed callzhang closed 2 years ago

callzhang commented 3 years ago

Background: I am trying to build an automated pipeline to segment sentence from the result of Google Speech-to-Text service. Issue: The -s parameter does not work as expected. See details below. Any suggestion would be very appreciated.(I am new to hebrew)

When using python3 -m hebpipe -s 'auto' xxx.txt, it would interpreted as "no option specified"

! You selected no processing options
! Assuming you want all processing steps

Running tasks:
====================
o Automatic sentence splitting
o Whitespace tokenization
o Morphological segmentation
o POS tagging
o Lemmatization
o Morphological analysis
o Dependency parsing
o Entity recognition
o Coreference resolution

Processing 1.txt

Finished processing 1 file

/opt/homebrew/opt/python@3.9/bin/python3.9: No module named hebpipe.__main__; 'hebpipe' is a package and cannot be directly executed
Elapsed time: 0:00:11.327
========================================
1       אז      אז      ADV     ADV     _       4       advmod  _       _
2       למה     למה     ADV     ADV     PronType=Int    4       advmod  _       _
3       אתה     הוא     PRON    PRON    Gender=Masc|Number=Sing|Person=2|PronType=Prs   4       obj     _       (2-person)
4       בטוח    בטוח    VERB    VERB    HebBinyan=PAAL|Voice=Act        0       root    _       _
...

Then I used the nlp function directly:

from hebpipe.heb_pipe import nlp
import io, sys
path = '/Users/Derek/Downloads/audio_sample/יאיר לפיד שובר שתיקה באולפן(1).mp3.txt'
input_text = io.open(path,encoding="utf8").read()
processed = nlp(input_text, do_whitespace=False, do_tok=False, do_tag=False, do_lemma=False, do_parse=True, do_entity=False, sent_tag='auto')
sys.stdout.buffer.write(processed.encode("utf8"))

the output is the whole text:

1       ממשלה חשוכה גזענית הומופובי שחקנית ממשלה שתיקח כסף של אנשים עובדים ותיתן אותו לאנשים שלא עובדים וכל זה מתנקז עכשיו למדתה 61 מורן צדקה קודם יש תקווה יש אפשרות לשינוי השאלה איך הצביע עמדת ה-60 יש הטוענים שזו הייתה ועכשיו הנוכחות שלך באולפנים ממש תעשה בשלושה ימים האחרונים בעצם אתה מודה שזה גם נגמר בקול ענות חלושה כי שואלים עצמם כולם מה הוא חשב שהוא ירים את כל האולפנים ואז יגיע להיות מועמד ראש בראש מול נתניהו וזה יעבור חלק או שמראש לא הייתה כוונה כזאת תגיד לנו אותה טיפה מוזר לאולפן ולהתראיין על זה לא היתה אני אנחנו הוא היה ניסיון של נתניהו להפוך את זה למאבק ראש בראש ביני לבינו לניסיון זה המטרה אתה לא עונה לו     _       _       _       _       _       _
amir-zeldes commented 3 years ago

Hi - I'm not sure what you're trying to do exactly; depending on that, directly importing the nlp function may or may not work for you.

However in this case, I can see that you're calling nlp with do_whitespace=False and do_tok=False, so the result is that you're telling the nlp function that your input string is one giant token. If your input is not yet split into whitespace-based tokens (i.e. separating 'big words', commas etc), you should use do_whitespace=True. The setting do_tok=True will then segment each 'big' word form into its constituent parts (Hebrew articles, prepositions, etc.). If you want tagging, parsing etc., those should be switched on too. Does that help?

callzhang commented 3 years ago

Thanks for your reply. I tried python3 -m hebpipe -wt -s auto xxx.txt and the result is still segmented by word:

Elapsed time: 0:00:02.367
========================================
1       אז      _       _       _       _       _       _
2       למה     _       _       _       _       _       _
3       אתה     _       _       _       _       _       _
4       בטוח    _       _       _       _       _       _
5-6     הבליץ   _       _       _       _       _       _       _       _
5       ה       _       _       _       _       _       _
6       בליץ    _       _       _       _       _       _
7       אבל     _       _       _       _       _       _
8       ראש     _       _       _       _       _       _
9       ממשלה   _       _       _       _       _       _
10      היה     _       _       _       _       _       _
...
amir-zeldes commented 3 years ago

Yes, I'm not sure I understand the problem - this output looks correct, there is even a subword segmentation in 5-6. The text is a little non-sensical for me so I'm not sure if there is a semantic problem with the data? Do you mean that there are no sentence splits anywhere? For this fragment that doesn't surprise me, as there are no sentence final punctuations like period or "!", "?" etc.

Could you give an example of the output you were expecting?

callzhang commented 3 years ago

Thank you. I was expecting that the output will be splitted by sentence, when there is no punctuations given. The output of google ASR contains no punctuations. Is there any suggestions you have to achieve sentence segmentation?

amir-zeldes commented 3 years ago

@callzhang, the sentence splitter in the master branch is punctuation based, so if your data has no punctuation it will indeed produce one huge sentence.

The dev branch now has an experimental neural splitter which I needed to add for another project anyway, but it is not completely tested yet. It may be able to do a little better, but I'm guessing it will still be pretty bad for your data, since its training data has sentence final punctuation in well over 95% of cases (the actual main problem it fixes is over-splitting data with punctuation inside direct speech and other contexts). You can try it with the supplied model, but the best thing to do would be to manually split some of your data, add it to the training data, and retrain the sentence splitter.