Open Heated-Canine opened 4 months ago
Great question! The short answer is that parsing sentences written by students can be quite challenging, and it's difficult to handle every single edge case. In this case, "これは今朝の電車で考えていました" wasn't included in the list of correctable sentences because it was not terminated with a punctuation mark. Our current sentence parsing algorithm for Japanese relies on specific punctuation marks to identify sentence boundaries.
A more detailed answer
When it comes to parsing sentences programmatically, it's hard to handle every edge case. That's why there are specific libraries like NLTK and spacy to help us process text. However, these libraries are typically trained with properly structured sentences (e.g. text from newspapers). So, it assumes that the input text will follow conventional grammar and punctuation rules, and uses that to figure out sentence boundaries.
In the past, we've tried relying solely on a generic sentence parsing algorithm or letting NLP parse it for us. However, both of these have their own limitations and neither was perfect. Why? Because students learning a language write in unpredictable ways.
Imagine a sentence with this kind of structure (a real example we've ran into):
I went to school today and I saw Bob (my mom likes Bob....... Bob is a great guy) and invited him to dinner... we will eat pizza.
A simple algorithm might split the sentence by a terminator .
:
I went to school today and I saw Bob (my mom likes Bob.
We can try to handle this edge case by considering parentheses ( ):
I went to school today and I saw Bob (my mom likes Bob....... Bob is a great guy) and invited him to dinner.
This still isn't their complete "sentence." But even if we manage to handle this edge case, it falls apart if they use < > instead of ( ). It quickly becomes unmanageable.
Our current solution, which gives decent results, blends NLP with generic sentence splitting. You can see the implementation here: https://github.com/LangCorrect/server/blob/096ed28dfc7539f74d233777b491118458713ba2/langcorrect/posts/utils.py#L57.
Essentially it:
。!?
The problem with これは今朝の電車で考えていました
occurs in step 2. Since it's not terminated, it never get added to the list of sentences and won't be included in the list of correctable sentences.
Now that I think about it, a potential area to look into can be marking sentence boundaries when it encounters a verb.
What are some other examples of users being unpredictable?
Blending both target and native language together into the target language text area (#369, #483)
Luckily, I finished preparing my presentation last night.
プレゼンテーションが上手くいくといいなと思います。
The tokenizer expected the text language to be in Japanese and splits the sentence like this
Luckily,Ifinishedpreparingmypresentationlastnight. プレゼンテーションが上手くいくといいなと思います。
Japanese sentences do not use spaces to separate words so the English text gets squished together.
A solution to this is proposed by @kafmws in #409.
AI/ML can most likely handle this more efficiently, but it's outside my domain of expertise.
I feel like I've been blabbering too much, so I'll end it here. I hope this kind of sheds some light!!
Thank you. Then, why didn't the last sentence appear in the following entry although the sentence was a complete sentence with the final dot ("maru" in Japanese)? Thank you! https://langcorrect.com/journals/%E6%98%BC%E4%BC%91%E3%81%BF/
"腹が立った。" Might it not be recognized because the number of the letters are too small (5 letters) in Japanese?
Oops! Now I notice that ”腹が立った” has appeared, and other two correction were able to correct (actually not correct) the sentence. Lo and behold, ”腹が立った” has also appeared when I try to edit my corrections, as if I had mis-recoginzed its presence from the beginning. I think I definitely checked the existence of the last sentence for a couple of times before I made the first correction to that entry. I'm not sure what's going on here. I'm not even sure if I myself just made a careless mistake. I wonder if I can proof of that the last sentence did NOT appear at first.
Assuming that's the case, then it seems to be a timing issue. Here's what most likely happened:
Our system does not currently support real-time updates, and is designed to handle static text. So any changes made by the author during your correction process would not have been visible to you at the time, unless you refreshed the page. However, refreshing the page would have caused your corrections to be lost and you'd have to start over (no draft saving feature at the moment). Also, posts can only be edited until a correction is made. The signal will still be triggered, but because the text did not change, there won't be any "updates".
Signal
@receiver(post_save, sender=Post)
def split_post_into_sentences(sender, instance, created, **kwargs):
post = instance
user = post.user
post_sentences = sentence_splitter.split_sentences(post.text, post.language.code)
title_and_sentences = [post.title] + post_sentences
if created:
create_post_rows(user, post, title_and_sentences)
else:
update_post_rows(user, post, title_and_sentences)
Where the text becomes uneditable
class CustomPostForm(forms.ModelForm):
# ...
def __init__(self, user, *args, is_convert_prompt=False, **kwargs):
# ...
if self.instance.is_corrected:
self.fields["text"].disabled = True
Thank you. Your explanation makes sense! Unfortunately, I don't remember if I added the last dot or "maru" when I made corrections for that entry. I'm now asking to the writer of that entry if they edit the last sentence just after I started making corrections.
The last sentence or the last few sentences of my entries written in English sometimes didn't appear on the correction-making site. However, I'm a very punctual person to write the "dot" at the end of my sentences except "Lol." So I'm curious what's going on my cases. Now, I can't pick up those entries, so I'll ask here again, when the same kind of error happens in my future entry. Thanks!
They say they didn't edit their entry at all at that time, which may bring us back to the conundrum: What was going on there?
Unfortunately, I don't remember if I added the last dot or "maru" when I made corrections for that entry.
I'm starting to get confused.
You corrected a different sentence and within this correction ,you added a correction for the "missing" sentence, right? In this case, why would it matter if you added the last dot? This has nothing to do with sentence splitting.
I looked into the timestamps and there is nothing that immediately stands out to me for why the last sentence would be "missing" for just you during the time of your corrections. Interestingly, a different user corrected this post ealrier than you, and they were able to "see" and corrected the last sentence.
Post
Post Rows ("sentences")
Corrections heatedcanine
wanderer
kikokun
The reason why the post
/post rows
's modified date is the same as the last corrector's (kikokun
) is because when the corrections are submitted, the post correction status gets updated. And when this status gets updated, it triggers the sentence splitting signal.
Before I start investigating, is there a chance that we just didn't notice the last sentence? I only ask this because the other two correctors have corrected the "missing" sentence, and from our earlier discussions, we know that posts become uneditable once a correction is made.
So either:
Notes for myself:
Thank you. Your analysis revealed that my memory was not that accurate. I was the second correction maker, not the first one, right?
According to your analysis, I think the most probable explanation for this episode is that I was just careless and missed the existence of the last sentence as "human error."
For example, you lost your phone, and you were looking for it for a long time. In the end, you noticed that your phone was on the desk where it should be located from the beginning. You had tried to look at that desk over and over, but you couldn't recognise it. Somehow, your brain couldn't see there was the phone on the desk.
The same phenomenon happened here.
Yet, I'm curious about your second scinerio. And I'll be waiting for your interesting research. Thanks!
It's my pleasure. Thanks for being curious and asking questions.
Yes, according to the database timestamps, you were the second corrector.
I definitely know that feeling. It happens. I've personally spent time looking for my TV remote only to notice that it was in my hands the entire time. The worst is looking for my glasses while wearing them 🤓.
Given the timestamps and how the other correctors corrected the sentence in question, I think it's best to revisit this if the issue happens again.
I'll look into creating a better solution for splitting Japanese text. It'll take some time, but bear with me.
Why sometimes ( I guess one in seven or ten entries) is the last sentence not shown on the correction making page?
For example: https://langcorrect.com/journals/%E8%87%AA%E5%B7%B1%E8%A1%A8%E7%8F%BE/
The last sentence "これは今朝の電車で考えていました" is not shown on the correction making page. Could you tell me the reason or mechanism why it is not shown? Or could you tell me how to avoid such a phenomenon on LangCorrect?
Thanks!