Closed jnehring closed 8 years ago
http://dbpedia.org/page/B%C3%A9roul is properly encoded and it is a valid link. The links are provided as they are in DBpedia.
Hi,
Just to be clear, we talked about the case of linking with the ORCID dataset, not DBpedia.
I marked 4 examples in blue from the dataset I compiled a few days ago (rows: 56,71,74,80). I have found these names in ORCID and now I'm just trying to find what FREME NER needs to do the linking part.
If I run FREME NER on these names, I get no results from ORCID. If I replace the acute characters (é, á) with their regular latin equivalents (e, a) and maybe reverse the order of words, then I sometimes get the desired results. Same happens when names include a hyphen (-) or a full stop (.). If I remove them, chances are that I get something back from ORCID.
So, my question was: do I need to solve such problems as part of the pre-processing that Agroknow does on author records? Or is this something that should be taken into account on the server-side of FREME?
I tried out the problem. I did these two API requests to link "Jose F. Marcos" to ORCID:
The first submits the name "José F. Marcos", the second "Jose F. Marcos". The first does not produce a link, the second produces a link. In ORCID the name is written as in the 2nd request.
Maybe Solr has a feature to enable searching for both spelling variations with one request.
Hi @jnehring
Yes... Solr provides a feature do deal with it. We can configure the approach (string match/metrics algorithms) , accuracy etc. I will talk with @m1ci about it.
Ok spell checking sounds like an interesting solution but I did not really understand how it solves the problem. But if you think it helps - go on.
I found a thread on GitHub. It is about using the tokenizer to convert special characters to their "normal" form (e.g. à
to a
). I am sure it works but I the downside is that a) the size of our index will double and b) I think we have to re-index the solr server / re-upload all datasets.
any news here, @sandroacoelho ?
Hi @jnehring: I am doing my final tests. I intend to commit tomorrow
@sandroacoelho found the bug, will fix it these days. No re-indexing will be required.
@sandroacoelho found the bug, will fix it these days. No re-indexing will be required.
great!
I've tested some approaches locally and works for string fields. FREME is using text_general field for labels. I am reading Solr docs looking for a solution without rebuilding the index
We need to reindex our current index. Can we do this at any time or must do during the daybreak?
For FREME dev you can do it anytime. For FREME live we should announce in the developer call when this happens. I think on FREME live it can also happen during normal working hours as long as the partners are informed.
yes, @sandroacoelho please do this at the FREME dev first. For the FREME live we need to inform the consumers at least one week in advance.
todo list
@sandroacoelho this is just an idea, tell me when I am wrong: Maybe the task of re-indexing SOLR can be combined with the task of Make datasets available? So you restart FREME NER with an empty database and then run the scripts you created in #114 to upload the datasets to FREME NER?
Hi @jnehring :+1: Yes. You are right. I am working on both tasks
Best,
I checked the two examples you send me and it works fine. The issue is solved. Thank you @sandroacoelho !
http://198.199.121.96/freme-ner/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&outformat=turtle&mode=link&input=Jos%C3%A9%20F.%20Marcos http://198.199.121.96/freme-ner/e-entity/freme-ner/documents?dataset=orcid&language=en&informat=text&outformat=turtle&mode=link&input=Jose%20F.%20Marcos
Next task: Replace the existing datasets on freme-dev and freme-live.
I am closing this issue and open a new one as a task
During last developers call, @freme-project/agro-know stated that there are problems with special characters in FREME NER.
I tried
http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&informat=text&input=%C3%89vreux%20is%20a%20city%20in%20France.
http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&informat=text&input=B%C3%A9roul%20is%20a%20french%20author.
Both did not exhibit problems with special characters. Only one link http://dbpedia.org/page/B%C3%A9roul produced by the second call is odd: It contains no information apart from that this resource is the same as http://dbpedia.org/page/Béroul
@Katsivelisp can you provide some examples of what goes wrong with special characters? @m1ci can you please comment on the strange link of Béroul?