freme-project / e-Entity

Apache License 2.0
1 stars 1 forks source link

Smart Filters #51

Closed jnehring closed 8 years ago

jnehring commented 9 years ago

@koidl suggested in #48 to write smart filters that filter out wrong spottings. The original suggestion is to filter out everything with one character (see https://github.com/freme-project/freme-ner/issues/35) or non-alphanumerical characters.

This GitHub issue should move this topic forward.

Maybe a good first step is to collect some miss spottings that can be eliminated by such a filter?

One more thought about this topic: Such a filter can improve precision at the cost of recall. That means through filtering we will detect less entities and also less wrong spottings. I think for WRIPL and topic detection precision is more important then recall.

m1ci commented 9 years ago

Maybe a good first step is to collect some miss spottings that can be eliminated by such a filter?

NOT spotted as entity

x-fran commented 9 years ago

Is there any possibility that dbpedia-spotlight may give you wrong entities back? I mean, dbpedia as 100% of the software is not bug free. Exclude not spotted only may not be enough. Good idea but not enough.

m1ci commented 9 years ago

Is there any possibility that dbpedia-spotlight may give you wrong entities back?

Sure.

Exclude not spotted only may not be enough. Good idea but not enough.

There are several way to approach the problem of wrong spotted entities:

I mean, dbpedia as 100% of the software is not bug free.

Wrong spotted entities are usually not bugs but they are just spotted wrong because the trained model "thinks" the word(s) is an entity.

Exclude not spotted only may not be enough. Good idea but not enough.

1) have a black list of wrong spotted entities - which should be excluded from the output. Since, in your data NOT is occurring very often and it is always spotted (wrong) as an entity, this might help. The minus of this approach is that this requires some manual effort to create such lists.

2) train spotting model for data similar to yours (dirty Web based). This might be problem since such data don't exist. However, if you can provide such data, ti can be great. We currently train on around 120K annotated sentences.

Any other ideas?

x-fran commented 9 years ago

Maybe also helps using a stop words list on your side. http://www.ranks.nl/stopwords. If you can implement this of course. I don't know. Its your call.

m1ci commented 9 years ago

can you please elaborate more on your idea? why you propose stopwords and how they can help?

x-fran commented 9 years ago

Basically is just take out from content all the words from stop words list before doing anything else.

m1ci commented 9 years ago

Using stopwords is basically same as the idea 1)

"NOT" can be also an entity. Removing words such as "NOT", will influence "NOT" not being matched to entities such as https://en.wikipedia.org/wiki/Nordic_Optical_Telescope

In your web sites you may have texts including words such as "WARNING", "DISCLAIMER", etc. which also (I think) will be spotted as entities - but I think they won't be in the stopwords list.

x-fran commented 9 years ago

Yes. You have your point here but always you can remove "NOT" from list or any other item if this helps somehow. Anyway its just an idea.

jnehring commented 9 years ago

Maybe its a good idea to implement smart filters / stop word lists first on the client side. Then WRIPL can experiment with the feature and see if it is useful. Then we can integrate it in FREME.

I think it is very easy to implement such a filter on the client.

If it is not useful, then it is easier to remove it from the client then to remove it from FREME. Also in FREME we write documentation, integration tests, ... which we do not need on the client side.

m1ci commented 8 years ago

seems to be a "dead" issue. Closing it. If further discussion is needed feel free to reopen it.