dbpedia / mappings-tracker

This project is used for tracking mapping issues in mappings.dbpedia.org
9 stars 6 forks source link

warn on attempt to create mapping that's on the ignore list #39

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

As explained in this comment: "ignorelist_*.txt is used in the mapping statistics, to exclude templates and properties that ought not to be mapped. Not adding a template like this to the ignore list is tantamount to inviting future mappers to re-add junk mappings"

People need no invitation to make foolish mappings. They need active discouragement. So please:

jimregan commented 9 years ago

I think it might be better to have two extra templates on the mapping wiki, {{IgnoreTemplate}} and {{IgnoreProperty}}, and have the ignore lists generated by the extraction framework. Jumping from place to place is a bit of a pain, and it would make it explicit, rather than having to delete the mappings.

VladimirAlexiev commented 9 years ago

@alexandrutodor what do you think?

alexandrutodor commented 9 years ago

As a quick response. I think we could keep the Ignorelists in the wiki. Automatically create articles for all of them with the extra templates proposed by Jim, and block those articles from being edited by other people. If users can't create those mappings then there's no problem and no need for warnings. Question is, is it such a big problem. What priority should this have, how often does it happen ?

jcsahnwaldt commented 9 years ago

@jimregan @alexandrutodor {{IgnoreTemplate}} and {{IgnoreProperty}} sound like a good idea!

With these templates, we wouldn't even need the ignore list files anymore. Here's how it could be implemented: When the server (the Scala code serving everything under http://mappings.dbpedia.org/server/) starts, it should fetch all ignored templates and properties from the mappings wiki and keep the ignore lists in memory. Each time a page is changed through the mappings wiki, the wiki should send an update to the server. This way, the ignore lists are always up to date, and we don't need to save them in files.

It probably would take at most a few hours to implement this, because most of the code already exists. When the server starts, it already loads all the mappings from the wiki, keeps them in memory and uses them to compute the statistics and for the sample extraction. It also receives update notifications from the mappings wiki. [1][2] We only need to add a few lines of code to check if a page contains one of the ignore templates and update the ignore lists. We patched the MediaWiki code of the mappings wiki so it sends a notification to the server each time a wiki page is modified, and we probably don't need to change anything on that side. (I don't know if that code is in some repo or just in the PHP files on mappings.dbpedia.org ...)

With a few more hours of work, it should be possible to write a script that goes through the current ignore list files and updates the appropriate mappings wiki page through http://mappings.dbpedia.org/api.php . Pretty simple for templates (because {{IgnoreTemplate}} would be the whole content of the page), much harder for properties (because {{IgnoreProperty}} would have to inserted in the correct place).

Some historical background:

The ignore lists started life as a bit of a hack. Some templates and properties messed up the statistics, so we added the ability to ignore them: Just add the correct password to the URL of a statistics page and you will see "add to / remove from ignore list" buttons next to each template / property: http://mappings.dbpedia.org/server/statistics/en/?p=... [3][4]

Sorry for the somewhat awkward syntax of the files. They were not meant to be edited by hand, their syntax is optimized for machines, not for humans.

[1] https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/scala/org/dbpedia/extraction/server/ExtractionManager.scala [2] https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/scala/org/dbpedia/extraction/server/DynamicExtractionManager.scala [3] https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/scala/org/dbpedia/extraction/server/resources/TemplateStatistics.scala#L212 [4] https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/scala/org/dbpedia/extraction/server/resources/PropertyStatistics.scala#L130

jcsahnwaldt commented 9 years ago

Regarding the priority: Not very high, I would guess.

Maybe this would be a nice GSoC warmup task?

alexandrutodor commented 9 years ago

Hi Christopher,

Great point, it's good to know that some of the functionality to do this is already partially in the code.

(I don't know if that code is in some repo or just in the PHP files on mappings.dbpedia.org ...)

Most of it is just a standard mediawiki with an extra extension called the Ultrapedia-API. I'm currently working on updating the wiki to a newer version since it hasn't been updated in some time.

I think this would be a great GSoC warmup task, and I could guide a student to do it since I already have everything in place including a test wiki where he or she can try things out. The existing mappings on the ignorelist can be added with missingbot in the wiki.

Cheers, Alexandru

VladimirAlexiev commented 9 years ago

Keeping them in the wiki will allow us to write justifications, explanations and discussions. Eg there's some useful info in Quote (see https://github.com/dbpedia/mappings-tracker/issues/27) but we nuked it whole.

currently working on updating the wiki to a newer version

Thank you!!!

is it such a big problem?

I looked at https://github.com/dbpedia/extraction-framework/blob/master/server/src/main/statistics/ignorelist_en.txt. It has:

I did "curl -I" for each of the templates. There are 4 exceptions (existing mappings): https://github.com/dbpedia/mappings-tracker/issues/40. It's not a huge problem.

Many other ignorelists are empty (eg bg, eu). Since many templates are copied from en following it as best practice, it would be great if we can also somehow propagate the ignorelists. Eg I've seen many "image size" prop mappings in bg and some other languages, though these are ignored in en. Ideally, the mapping language would be made more modular, see http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html#sec-3-3. But I don't know how to do this...