Heated-Canine commented 4 months ago

Why sometimes ( I guess one in seven or ten entries) is the last sentence not shown on the correction making page?

For example: https://langcorrect.com/journals/%E8%87%AA%E5%B7%B1%E8%A1%A8%E7%8F%BE/

The last sentence "これは今朝の電車で考えていました" is not shown on the correction making page. Could you tell me the reason or mechanism why it is not shown? Or could you tell me how to avoid such a phenomenon on LangCorrect?

Thanks!

Heated-Canine commented 4 months ago

571 was the same question, I think.

danielzeljko commented 4 months ago

Great question! The short answer is that parsing sentences written by students can be quite challenging, and it's difficult to handle every single edge case. In this case, "これは今朝の電車で考えていました" wasn't included in the list of correctable sentences because it was not terminated with a punctuation mark. Our current sentence parsing algorithm for Japanese relies on specific punctuation marks to identify sentence boundaries.

Screenshot from 2024-07-16 02-59-44

A more detailed answer

When it comes to parsing sentences programmatically, it's hard to handle every edge case. That's why there are specific libraries like NLTK and spacy to help us process text. However, these libraries are typically trained with properly structured sentences (e.g. text from newspapers). So, it assumes that the input text will follow conventional grammar and punctuation rules, and uses that to figure out sentence boundaries.

In the past, we've tried relying solely on a generic sentence parsing algorithm or letting NLP parse it for us. However, both of these have their own limitations and neither was perfect. Why? Because students learning a language write in unpredictable ways.

Imagine a sentence with this kind of structure (a real example we've ran into):

I went to school today and I saw Bob (my mom likes Bob....... Bob is a great guy) and invited him to dinner... we will eat pizza.

A simple algorithm might split the sentence by a terminator .:

I went to school today and I saw Bob (my mom likes Bob.

We can try to handle this edge case by considering parentheses ( ):

I went to school today and I saw Bob (my mom likes Bob....... Bob is a great guy) and invited him to dinner.

This still isn't their complete "sentence." But even if we manage to handle this edge case, it falls apart if they use < > instead of ( ). It quickly becomes unmanageable.

Our current solution, which gives decent results, blends NLP with generic sentence splitting. You can see the implementation here: https://github.com/LangCorrect/server/blob/096ed28dfc7539f74d233777b491118458713ba2/langcorrect/posts/utils.py#L57.

Essentially it:

relies on NLP to tokenize the sentence into nodes
builds the sentence by adding each text node until it finds one of the terminators (punctuation marks), like 。！？
when it runs into a terminator then it considers the sentence compelte and adds it to the list of sentences
after it finishes processing all of the text it will return the sentences it found

The problem with これは今朝の電車で考えていました occurs in step 2. Since it's not terminated, it never get added to the list of sentences and won't be included in the list of correctable sentences.

Now that I think about it, a potential area to look into can be marking sentence boundaries when it encounters a verb.

What are some other examples of users being unpredictable?

Blending both target and native language together into the target language text area (#369, #483)

Luckily, I finished preparing my presentation last night.
プレゼンテーションが上手くいくといいなと思います。

The tokenizer expected the text language to be in Japanese and splits the sentence like this

Luckily,Ifinishedpreparingmypresentationlastnight. プレゼンテーションが上手くいくといいなと思います。

Japanese sentences do not use spaces to separate words so the English text gets squished together.

A solution to this is proposed by @kafmws in #409.

AI/ML can most likely handle this more efficiently, but it's outside my domain of expertise.

I feel like I've been blabbering too much, so I'll end it here. I hope this kind of sheds some light!!

Heated-Canine commented 4 months ago

Thank you. Then, why didn't the last sentence appear in the following entry although the sentence was a complete sentence with the final dot ("maru" in Japanese)? Thank you! https://langcorrect.com/journals/%E6%98%BC%E4%BC%91%E3%81%BF/

"腹が立った。" Might it not be recognized because the number of the letters are too small (5 letters) in Japanese?

Heated-Canine commented 4 months ago

Oops! Now I notice that ”腹が立った” has appeared, and other two correction were able to correct (actually not correct) the sentence. Lo and behold, ”腹が立った” has also appeared when I try to edit my corrections, as if I had mis-recoginzed its presence from the beginning. I think I definitely checked the existence of the last sentence for a couple of times before I made the first correction to that entry. I'm not sure what's going on here. I'm not even sure if I myself just made a careless mistake. I wonder if I can proof of that the last sentence did NOT appear at first.

danielzeljko commented 4 months ago

Assuming that's the case, then it seems to be a timing issue. Here's what most likely happened:

The author published their post, which triggered a signal to create sentences when the post was saved
You clicked to make corrections
The original text did not include a maru for the final sentence
While you were correcting their text, the author updated their post
You finished making your corrections

Our system does not currently support real-time updates, and is designed to handle static text. So any changes made by the author during your correction process would not have been visible to you at the time, unless you refreshed the page. However, refreshing the page would have caused your corrections to be lost and you'd have to start over (no draft saving feature at the moment). Also, posts can only be edited until a correction is made. The signal will still be triggered, but because the text did not change, there won't be any "updates".

Signal

https://github.com/LangCorrect/server/blob/096ed28dfc7539f74d233777b491118458713ba2/langcorrect/posts/models.py#L159

@receiver(post_save, sender=Post)
def split_post_into_sentences(sender, instance, created, **kwargs):
    post = instance
    user = post.user
    post_sentences = sentence_splitter.split_sentences(post.text, post.language.code)
    title_and_sentences = [post.title] + post_sentences

    if created:
        create_post_rows(user, post, title_and_sentences)
    else:
        update_post_rows(user, post, title_and_sentences)

Where the text becomes uneditable

https://github.com/LangCorrect/server/blob/096ed28dfc7539f74d233777b491118458713ba2/langcorrect/posts/forms.py#L35

class CustomPostForm(forms.ModelForm):
   # ...

    def __init__(self, user, *args, is_convert_prompt=False, **kwargs):
        # ...

        if self.instance.is_corrected:
            self.fields["text"].disabled = True

Heated-Canine commented 4 months ago

Thank you. Your explanation makes sense! Unfortunately, I don't remember if I added the last dot or "maru" when I made corrections for that entry. I'm now asking to the writer of that entry if they edit the last sentence just after I started making corrections.

Heated-Canine commented 4 months ago

The last sentence or the last few sentences of my entries written in English sometimes didn't appear on the correction-making site. However, I'm a very punctual person to write the "dot" at the end of my sentences except "Lol." So I'm curious what's going on my cases. Now, I can't pick up those entries, so I'll ask here again, when the same kind of error happens in my future entry. Thanks!

Heated-Canine commented 4 months ago

They say they didn't edit their entry at all at that time, which may bring us back to the conundrum: What was going on there?

danielzeljko commented 4 months ago

Unfortunately, I don't remember if I added the last dot or "maru" when I made corrections for that entry.

I'm starting to get confused.

Screenshot from 2024-07-16 21-48-15

You corrected a different sentence and within this correction ,you added a correction for the "missing" sentence, right? In this case, why would it matter if you added the last dot? This has nothing to do with sentence splitting.

I looked into the timestamps and there is nothing that immediately stands out to me for why the last sentence would be "missing" for just you during the time of your corrections. Interestingly, a different user corrected this post ealrier than you, and they were able to "see" and corrected the last sentence.

Post

Created: July 16, 2024, 20:08:26 UTC
Modified: July 16, 2024, 23:39:42 UTC

Post Rows ("sentences")

Created: July 16, 2024, 20:08:27 UTC
Modified: July 16, 2024, at 23:39:42 UTC

Corrections heatedcanine

Created: July 16, 2024, 22:08:33 UTC
Modified: July 16, 2024, 22:08:33 UTC

wanderer

Created: July 16, 2024, 20:35:21 UTC
Modified: July 16, 2024, 20:35:21 UTC

kikokun

Created: July 16, 2024, 23:39:42 UTC
Modified: July 16, 2024, 23:39:42 UTC

The reason why the post/post rows's modified date is the same as the last corrector's (kikokun) is because when the corrections are submitted, the post correction status gets updated. And when this status gets updated, it triggers the sentence splitting signal.

Before I start investigating, is there a chance that we just didn't notice the last sentence? I only ask this because the other two correctors have corrected the "missing" sentence, and from our earlier discussions, we know that posts become uneditable once a correction is made.

So either:

This sentence was present and we just didn't notice it while making our corrections, or
During the time of your corrections, the last sentence was not shown for some reason (is_actual=False), and after you made your corrections, it would switch this flag back to is_actual=True

Notes for myself:

Look into logic that retrieves the sentences
Write some tests to see if I can spot the bug
raw.txt

Heated-Canine commented 4 months ago

Thank you. Your analysis revealed that my memory was not that accurate. I was the second correction maker, not the first one, right?

According to your analysis, I think the most probable explanation for this episode is that I was just careless and missed the existence of the last sentence as "human error."

For example, you lost your phone, and you were looking for it for a long time. In the end, you noticed that your phone was on the desk where it should be located from the beginning. You had tried to look at that desk over and over, but you couldn't recognise it. Somehow, your brain couldn't see there was the phone on the desk.

The same phenomenon happened here.

Yet, I'm curious about your second scinerio. And I'll be waiting for your interesting research. Thanks!

danielzeljko commented 4 months ago

It's my pleasure. Thanks for being curious and asking questions.

Yes, according to the database timestamps, you were the second corrector.

I definitely know that feeling. It happens. I've personally spent time looking for my TV remote only to notice that it was in my hands the entire time. The worst is looking for my glasses while wearing them 🤓.

Given the timestamps and how the other correctors corrected the sentence in question, I think it's best to revisit this if the issue happens again.

I'll look into creating a better solution for splitting Japanese text. It'll take some time, but bear with me.

LangCorrect / server

Why Last Sentence Might Not Be Shown on Correction Making Page? #578

571 was the same question, I think.