Closed geohci closed 2 years ago
This almost certainly should be handled by the node-differ. If it was handled by the tree differ by creating a distinct node type, it would add potentially a fair bit of computational overhead (more nodes) or result in missing actual changes to the wikitext if starting/ending/consecutive whitespace was just stripped out from the Text nodes.
Current thinking per our discussion on the different Text subcategories that we'd want to distinguish between:
The content categories only will make sense initially in whitespace-delimited languages but we can see about extending after we've explored the possibilities a bit. The deltas library might be a good place to start: https://github.com/halfak/deltas/blob/master/deltas/tokenizers/text_split.py
Ok, now we've got an initial approach for parsing text and extracting the changes for whitespace, punctuation, words, sentences, and paragraphs. The primary question now is how to summary the counts of changes. Copying my comment from PR25 over with a bit more thinking:
Something I'm struggling with is how to count for changes. For example, say the text starts with Sentence with three spaces.\n\nNew paragraph and another five spaces.
and goes to Sentence with three spaces. No new paragraph so six spaces.
{' ':8, '\n\n':1}
and {' ':9}
. Is this represented as {'Whitespace': {'change':2}}
(difference gives {' ':1, '\n\n':-1}
whose absolute values sum to 2)? If we tracked whitespace as closely as we do the other node though, this would be {'Whitespace': {'change':1}}
because it was just one segment of whitespace (\n\n
) becoming another segment (
).{'New':-1, 'and':-1, 'another':-1, 'five':-1, 'No':1, 'new':1, 'so':1, 'six':1}
. Is this a change of 8 (sum of absolute values) or a change of 4 (summing up all the deleted words (4) and all the added words (4) and taking the maximum of the two)?Essentially I see three potential pathways:
Sentence with three spaces.\n\nNew paragraph and another five spaces.
as two Paragraph nodes each with a nested Sentence node that has Word, Whitespace, and Punctuation nodes in it. This would be probably more exact but also much more complicated and more overhead that's not necessarily clearly worth it (I think it's okay to have a bit more error around how many words were changed vs. how many references were added)Per our discussion today, we're going to go with option 3: take the maximum changes on either side. So this test case:
def test_change_text_count_english_punctuations():
curr_text = "Wait for it... awesome! More things to come. Why me?"
prev_text = "Waits for it... awesome!! More things to come. Why me?"
expected_changes = {'sentence_count':{'Waits for it... awesome':-1,'Wait for it... awesome':1},
'word_count':{'Waits':-1,'Wait':1},
"whitespace_count":{},
"punctuation_count":{'!':-1},
'paragraph_count':{"Waits for it... awesome!! More things to come. Why me?":-1,"Wait for it... awesome! More things to come. Why me?":1}
}
get_text_structure = nd.parse_change_text(prev_text,curr_text,'Text')
assert expected_changes == get_text_structure
would have expected changes something like this:
expected_changes = {'Sentence':{'change':1},'Word':{'change':1},'Punctuation':{'change':1},'Paragraph':{'change':1}}
Right now because the Text edit type is a change
, we make all the text subcategories into change
too. We can always return to this later if it doesn't seem to make sense.
Nice work @Amamgbu -- just remaining TODOs from what we discussed today on this issue are:
{"Text":{"change":1}}
part with something like {"Whitespace:{"change":3}, "Word":{"change":2}, ...}
so whitespace/word/punctuation/sentence/paragraph show up in the same way as template, media, etc. Only include the text subcategories if there are changes though -- i.e. if no whitespace changes, then it doesn't need to be included.parse_text
function and just using the change one but potentially passing prev_wikitext=""
for inserts or curr_wikitext=""
for removals and using that to determine if the result dictionary is e.g., "{Whitespace":{"change":3}, ...}
or "{Whitespace":{"insert":3}, ...}
etc.I have made the fix in the most recent PR
95% of the way there! From what we discussed today, the following small fixes came up:
Alright. I have implemented the fix.
Fixed with PR #39 ! We will open up additional issues to extend to more languages but full first version complete
Oftentimes when adding a new category or template or image etc., the editor also adds a new-line or space to the wikitext. Currently, this is registered (correctly) as a text change. We probably want to distinguish between pure white-space changes though and actual character/text changes.
Example: https://wiki-topic.toolforge.org/diff-tagging?lang=en&revid=821296470