freme-project / freme-ner

Apache License 2.0
6 stars 1 forks source link

Problem with special characters in FREME NER #91

Closed jnehring closed 8 years ago

jnehring commented 8 years ago

During last developers call, @freme-project/agro-know stated that there are problems with special characters in FREME NER.

I tried

http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&informat=text&input=%C3%89vreux%20is%20a%20city%20in%20France.

http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&informat=text&input=B%C3%A9roul%20is%20a%20french%20author.

Both did not exhibit problems with special characters. Only one link http://dbpedia.org/page/B%C3%A9roul produced by the second call is odd: It contains no information apart from that this resource is the same as http://dbpedia.org/page/Béroul

@Katsivelisp can you provide some examples of what goes wrong with special characters? @m1ci can you please comment on the strange link of Béroul?

m1ci commented 8 years ago

http://dbpedia.org/page/B%C3%A9roul is properly encoded and it is a valid link. The links are provided as they are in DBpedia.

Katsivelisp commented 8 years ago

Hi,

Just to be clear, we talked about the case of linking with the ORCID dataset, not DBpedia.

I marked 4 examples in blue from the dataset I compiled a few days ago (rows: 56,71,74,80). I have found these names in ORCID and now I'm just trying to find what FREME NER needs to do the linking part.

If I run FREME NER on these names, I get no results from ORCID. If I replace the acute characters (é, á) with their regular latin equivalents (e, a) and maybe reverse the order of words, then I sometimes get the desired results. Same happens when names include a hyphen (-) or a full stop (.). If I remove them, chances are that I get something back from ORCID.

So, my question was: do I need to solve such problems as part of the pre-processing that Agroknow does on author records? Or is this something that should be taken into account on the server-side of FREME?

jnehring commented 8 years ago

I tried out the problem. I did these two API requests to link "Jose F. Marcos" to ORCID:

http://api.freme-project.eu/current/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&mode=link&input=Jos%C3%A9%20F.%20Marcos

http://api.freme-project.eu/current/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&mode=link&input=Jose%20F.%20Marcos

The first submits the name "José F. Marcos", the second "Jose F. Marcos". The first does not produce a link, the second produces a link. In ORCID the name is written as in the 2nd request.

Maybe Solr has a feature to enable searching for both spelling variations with one request.

sandroacoelho commented 8 years ago

Hi @jnehring

Yes... Solr provides a feature do deal with it. We can configure the approach (string match/metrics algorithms) , accuracy etc. I will talk with @m1ci about it.

jnehring commented 8 years ago

Ok spell checking sounds like an interesting solution but I did not really understand how it solves the problem. But if you think it helps - go on.

I found a thread on GitHub. It is about using the tokenizer to convert special characters to their "normal" form (e.g. à to a). I am sure it works but I the downside is that a) the size of our index will double and b) I think we have to re-index the solr server / re-upload all datasets.

jnehring commented 8 years ago

any news here, @sandroacoelho ?

sandroacoelho commented 8 years ago

Hi @jnehring: I am doing my final tests. I intend to commit tomorrow

m1ci commented 8 years ago

@sandroacoelho found the bug, will fix it these days. No re-indexing will be required.

jnehring commented 8 years ago

@sandroacoelho found the bug, will fix it these days. No re-indexing will be required.

great!

sandroacoelho commented 8 years ago

I've tested some approaches locally and works for string fields. FREME is using text_general field for labels. I am reading Solr docs looking for a solution without rebuilding the index

sandroacoelho commented 8 years ago

We need to reindex our current index. Can we do this at any time or must do during the daybreak?

jnehring commented 8 years ago

For FREME dev you can do it anytime. For FREME live we should announce in the developer call when this happens. I think on FREME live it can also happen during normal working hours as long as the partners are informed.

m1ci commented 8 years ago

yes, @sandroacoelho please do this at the FREME dev first. For the FREME live we need to inform the consumers at least one week in advance.

sandroacoelho commented 8 years ago

todo list

jnehring commented 8 years ago

@sandroacoelho this is just an idea, tell me when I am wrong: Maybe the task of re-indexing SOLR can be combined with the task of Make datasets available? So you restart FREME NER with an empty database and then run the scripts you created in #114 to upload the datasets to FREME NER?

sandroacoelho commented 8 years ago

Hi @jnehring :+1: Yes. You are right. I am working on both tasks

Best,

jnehring commented 8 years ago

I checked the two examples you send me and it works fine. The issue is solved. Thank you @sandroacoelho !

http://198.199.121.96/freme-ner/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&outformat=turtle&mode=link&input=Jos%C3%A9%20F.%20Marcos http://198.199.121.96/freme-ner/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&outformat=turtle&mode=link&input=Jose%20F.%20Marcos

jnehring commented 8 years ago

Next task: Replace the existing datasets on freme-dev and freme-live.

sandroacoelho commented 8 years ago

I am closing this issue and open a new one as a task