geohci / edit-types

Edit diffs and type detection for Wikipedia
MIT License
12 stars 3 forks source link

Expand Text Subcategories -- e.g., whitespace, punctuation, content #4

Closed geohci closed 2 years ago

geohci commented 2 years ago

Oftentimes when adding a new category or template or image etc., the editor also adds a new-line or space to the wikitext. Currently, this is registered (correctly) as a text change. We probably want to distinguish between pure white-space changes though and actual character/text changes.

Example: https://wiki-topic.toolforge.org/diff-tagging?lang=en&revid=821296470

geohci commented 2 years ago

This almost certainly should be handled by the node-differ. If it was handled by the tree differ by creating a distinct node type, it would add potentially a fair bit of computational overhead (more nodes) or result in missing actual changes to the wikitext if starting/ending/consecutive whitespace was just stripped out from the Text nodes.

geohci commented 2 years ago

Current thinking per our discussion on the different Text subcategories that we'd want to distinguish between:

The content categories only will make sense initially in whitespace-delimited languages but we can see about extending after we've explored the possibilities a bit. The deltas library might be a good place to start: https://github.com/halfak/deltas/blob/master/deltas/tokenizers/text_split.py

geohci commented 2 years ago

Ok, now we've got an initial approach for parsing text and extracting the changes for whitespace, punctuation, words, sentences, and paragraphs. The primary question now is how to summary the counts of changes. Copying my comment from PR25 over with a bit more thinking:

Something I'm struggling with is how to count for changes. For example, say the text starts with Sentence with three spaces.\n\nNew paragraph and another five spaces. and goes to Sentence with three spaces. No new paragraph so six spaces.

Essentially I see three potential pathways:

geohci commented 2 years ago

Per our discussion today, we're going to go with option 3: take the maximum changes on either side. So this test case:

def test_change_text_count_english_punctuations():
    curr_text = "Wait for it... awesome! More things to come. Why me?"
    prev_text = "Waits for it... awesome!! More things to come. Why me?"
    expected_changes = {'sentence_count':{'Waits for it... awesome':-1,'Wait for it... awesome':1},
                        'word_count':{'Waits':-1,'Wait':1},
                        "whitespace_count":{},
                        "punctuation_count":{'!':-1},
                        'paragraph_count':{"Waits for it... awesome!! More things to come. Why me?":-1,"Wait for it... awesome! More things to come. Why me?":1}
                       }
    get_text_structure = nd.parse_change_text(prev_text,curr_text,'Text')
    assert expected_changes == get_text_structure

would have expected changes something like this:

expected_changes = {'Sentence':{'change':1},'Word':{'change':1},'Punctuation':{'change':1},'Paragraph':{'change':1}}

Right now because the Text edit type is a change, we make all the text subcategories into change too. We can always return to this later if it doesn't seem to make sense.

geohci commented 2 years ago

Nice work @Amamgbu -- just remaining TODOs from what we discussed today on this issue are:

Amamgbu commented 2 years ago

I have made the fix in the most recent PR

geohci commented 2 years ago

95% of the way there! From what we discussed today, the following small fixes came up:

Amamgbu commented 2 years ago

Alright. I have implemented the fix.

geohci commented 2 years ago

Fixed with PR #39 ! We will open up additional issues to extend to more languages but full first version complete