dbpedia / extraction-framework

The software used to extract structured data from Wikipedia
850 stars 270 forks source link

Field Sampling in mapping server #327

Open VladimirAlexiev opened 9 years ago

VladimirAlexiev commented 9 years ago

Enhancement to the mapping server:

Why:

What:

page field "venue"
Alpine skiing at the 2002 Winter Olympics [[Snowbasin]] (downhill, super-G, combined),<br>[[Park City Mountain Resort|Park City]] (giant slalom),<br>[[Deer Valley Resort|Deer Valley]] (slalom),<br>Utah, United States
ujjwalwahi commented 9 years ago

@VladimirAlexiev I am working on this task. Added a new page where we will list the field occurences, link to my code

To proceed further, I need a guidance on how to get all occurences of field.

VladimirAlexiev commented 9 years ago

Hi @ujjwalwahi! I don't know where to get them from, I'm sure @jimkont can help you. Where is the code of server/templatestatistics that shows the number of occurrences of each field?

ujjwalwahi commented 9 years ago

@VladimirAlexiev Code of server/templatestatistics is here

VladimirAlexiev commented 9 years ago

I did a bit of tracing. counter is derived from count which is obtained from sortedProps which comes from getMappingStats() which gets them from mappedStatistics. Search shows it in MappingStatsHolder. The closest to what we need is propertyUseCount Again search shows where propertyUseCount is summed. Gets it from val properties that's a map (name, (count, mapped)) Search shows that's constructed here and is obtained from wikiStats.templates This seems to be loaded here from some file...

This is as far as I got. @jimkont or @jcsahnwaldt can help better.

Nono314 commented 9 years ago

@VladimirAlexiev I think, they're loaded from here. Yes, that's the same place as ignorelists. I guess you've already been there recently :) There's even a hint just one line above the last one you linked to.

VladimirAlexiev commented 9 years ago

So @ujjwalwahi this is a bit of a dead end: the extractor loads field mapping stats from files, but I still don't know HOW these files are produced. Maybe @jimkont or @jcsahnwaldt can help?