ddavisqa / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Reconciliation (Freebase) of accented characters doesn't work #314

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?

1. Create a new project from the attached CSV file
2. Start reconciliation on column "name" (Freebase Reconciliation Service, 
reconcile against no particular type)
3. Note that no matches have been found for "Hotel Glória"
4. Click "Search for a match"
5. Click on the first "Hotel Glória" hit in the search results 
6. Click "Choose new match"
7. Note that the cell now reads "Hotel Gl�ria"

What is the expected output? What do you see instead?

I expect the cell content to remain set to "Hotel Glória", without garbling 
the non-ASCII characters.

I also expect the reconciliation to return some matches in the first place (it 
does if the cell is transliterated to ASCII "Hotel Gloria")

What version of Google Refine are you using?

google-refine-2.0-r1836

What operating system and browser are you using?

I've tested this on Debian Squeeze and Ubuntu 10.04 with Firefox 3.5.16 and 
Chrome 8.0.552.237 (Official Build 70801)

Is this problem specific to the type of browser you're using or it happens
in all the browsers you tried?

Happens on all browsers I tried.

Please provide any additional information below.

I suspect there is a bug somewhere in Refine that occasionally damages the 
string encodings. I have seen on several occasions that previously correctly 
displayed and encoded non-ASCII cell content was silently replaced with � 
characters. However the steps above are the only instance of this bug I've been 
able to consistently reproduce.

Also, it appears that the Freebase reconciliation service does not return any 
results at all for non-ASCII strings. That may be related to the problem 
described above.

Original issue reported on code.google.com by tomazs...@gmail.com on 28 Jan 2011 at 2:29

Attachments:

GoogleCodeExporter commented 8 years ago
The display issues appear to have been fixed with the earlier round of 
character encoding fixes, but reconciliation is still failing to return any 
results, despite the fact there is a perfect match, so let's focus this bug on 
that piece of things.

I'm not sure at this point whether the problem is with Refine or with the 
Freebase reconciliation service.

Original comment by tfmorris on 6 Jun 2011 at 9:48

GoogleCodeExporter commented 8 years ago
As I've done more reconciliation recently I've noticed a large number of misses 
for things which are in Freebase due, apparently, to a stale reconciliation 
service index, so it's possible the reconciliation miss was due to this rather 
than the diacritic.

More data/investigation needed...

Original comment by tfmorris on 3 Jul 2011 at 11:27

GoogleCodeExporter commented 8 years ago
The reconciliation service sucks, but as far as I can tell it doesn't suck any 
worse for international characters.  I downloaded a spreadsheet of 90 buildings 
with names beginning "Casa" to test with.

For 8 with accented names, it automatched 2, got no candidate at all for 3, got 
the correct candidate as the top scorer for the other three, but the score 
wasn't high enough to automatch.

For non-accented 82 names: 14 automatched, 16 no candidate, 52 with scores too 
low to automatch.

Note that since these all came from Freebase to start with, they should all be 
guaranteed to match.  My conclusion - it sucks, but doesn't suck any worse for 
accented characters.

Original comment by tfmorris on 19 Nov 2011 at 12:39

GoogleCodeExporter commented 8 years ago
Tom, did the spreadsheet have any other additional columns that contained 
metadata that could have also been matched as a disambiguator with the 
additional checkbox for a property on buildings or structures ?  (the use of 
additional columns to reconcile uses Collin's older recon service, which sucks 
sometimes, sometimes not, depending on domains)  Curious, if that would have 
changed the scoring.  Can you test ?

Original comment by thadguidry on 19 Nov 2011 at 1:05

GoogleCodeExporter commented 8 years ago

Original comment by tfmorris on 18 Sep 2012 at 3:03