Closed callzhang closed 2 years ago
Hi - I'm not sure what you're trying to do exactly; depending on that, directly importing the nlp function may or may not work for you.
However in this case, I can see that you're calling nlp with do_whitespace=False
and do_tok=False
, so the result is that you're telling the nlp function that your input string is one giant token. If your input is not yet split into whitespace-based tokens (i.e. separating 'big words', commas etc), you should use do_whitespace=True
. The setting do_tok=True
will then segment each 'big' word form into its constituent parts (Hebrew articles, prepositions, etc.). If you want tagging, parsing etc., those should be switched on too. Does that help?
Thanks for your reply. I tried python3 -m hebpipe -wt -s auto xxx.txt
and the result is still segmented by word:
Elapsed time: 0:00:02.367
========================================
1 אז _ _ _ _ _ _
2 למה _ _ _ _ _ _
3 אתה _ _ _ _ _ _
4 בטוח _ _ _ _ _ _
5-6 הבליץ _ _ _ _ _ _ _ _
5 ה _ _ _ _ _ _
6 בליץ _ _ _ _ _ _
7 אבל _ _ _ _ _ _
8 ראש _ _ _ _ _ _
9 ממשלה _ _ _ _ _ _
10 היה _ _ _ _ _ _
...
Yes, I'm not sure I understand the problem - this output looks correct, there is even a subword segmentation in 5-6. The text is a little non-sensical for me so I'm not sure if there is a semantic problem with the data? Do you mean that there are no sentence splits anywhere? For this fragment that doesn't surprise me, as there are no sentence final punctuations like period or "!", "?" etc.
Could you give an example of the output you were expecting?
Thank you. I was expecting that the output will be splitted by sentence, when there is no punctuations given. The output of google ASR contains no punctuations. Is there any suggestions you have to achieve sentence segmentation?
@callzhang, the sentence splitter in the master branch is punctuation based, so if your data has no punctuation it will indeed produce one huge sentence.
The dev branch now has an experimental neural splitter which I needed to add for another project anyway, but it is not completely tested yet. It may be able to do a little better, but I'm guessing it will still be pretty bad for your data, since its training data has sentence final punctuation in well over 95% of cases (the actual main problem it fixes is over-splitting data with punctuation inside direct speech and other contexts). You can try it with the supplied model, but the best thing to do would be to manually split some of your data, add it to the training data, and retrain the sentence splitter.
Background: I am trying to build an automated pipeline to segment sentence from the result of Google Speech-to-Text service. Issue: The
-s
parameter does not work as expected. See details below. Any suggestion would be very appreciated.(I am new to hebrew)When using
python3 -m hebpipe -s 'auto' xxx.txt
, it would interpreted as "no option specified"Then I used the
nlp
function directly:the output is the whole text: