geohci / edit-types

Edit diffs and type detection for Wikipedia
MIT License
12 stars 3 forks source link

Extend media detection #23

Closed geohci closed 2 years ago

geohci commented 2 years ago

We currently detect media that are edited when they are within the standard wikilink brackets -- e.g.,[[File:imagename.jpg]]. There are a fair number of images that are added without wikilink brackets though -- namely images added via infoboxes are often bare filenames as values for a given parameter -- e.g., image = imagename.jpg -- and the gallery tag also is used with bare filenames. Generally I don't want to try to capture too many edge-cases like this, but this is both a large edge-case and a relatively narrow one. There are relatively few valid media file extensions on the wikis that are pretty specific to media (e.g. people don't tend to write .png in wikitext unless they're inserting a file) and they only show up in two places: nested in templates and tags. So I think it would be pretty reasonable to add an additional function that takes any templates / gallery tags that show up in a diff and searches for media filenames in them and augments the results accordingly.

For more details on potential implementation, see: https://phabricator.wikimedia.org/T299712#7656439

geohci commented 2 years ago

Complete now on tree differ side. For this to really work though, the checks in is_change_in_edit_type and is_edit_type need to be relaxed to not require a wikilink. Maybe just change it to make sure the length of the parsed_text is >0?

Then the test can be uncommented and should work as expected.

EDIT: gallery tag no longer requires specific changes because handled by tree differ

geohci commented 2 years ago

Completed!