PolMine / dbpedia

R Wrapper for Corpus Annotation with DBpedia Spotlight
3 stars 0 forks source link

Add parameter `support` to `get_dbpedia_uris()` #30

Open ChristophLeonhardt opened 8 months ago

ChristophLeonhardt commented 8 months ago

get_dbpedia_uris() currently passes the text and the confidence parameter to DBpedia Spotlight. However, there are more parameters which influence the results of the service. These are described in the paper by Mendes et al. (2011) and shown in examples on the DBpedia Spotlight GitHub wiki (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service and https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual).

One of those parameters is "support" which sets a threshold of the minimum prominence of an entity in Wikipedia (pp. 3-4). The inclusion of support might be useful. If I am not mistaken, support could be added to the query parameter created for the GET in get_dbpedia_uris().

ablaette commented 8 months ago

Adding this argument is not a problem, see the implementation for types. In the examples at I see the values -1 and 20: What does it mean, what would a reasonable default value be?

ablaette commented 8 months ago

I have implemented the argument. What would a telling example be to be able to explain the effects of using the parameter? Is the (preliminary) documentation sufficient?

ChristophLeonhardt commented 8 months ago

According to Mendes et al. (2011: 3-4), support refers to the minimum number of inlinks of a resource. I do not think that it is explained further, but I assume that this refers to the number of other pages linking to the resource? It is used to determine the "Prominence" of a resource, according to the paper (Mendes et al. 2011: 3).

I am not sure what -1 means in the examples. I assume that no filtering is applied here, but I am not sure why this would not be the case with support = 0.

In my trials with more prominent concepts such as city or county names, this number can be a lot higher. A support value of 500 seemed plausible to me, but I assume this was due to the selection of specific entities.

I would assume that 20, as suggested in the examples liked above, might be reasonable when used on ordinary entities.

ablaette commented 8 months ago

So ... we should include Mendes et al. 2011 as a reference in the package!