ddavisqa / google-refine

Automatically exported from code.google.com/p/google-refine
0 stars 0 forks source link

Byte order marks ending up in project #356

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Despite some digging it hasn't been clear how this data's come about although 
it's highly likely it went through Excel originating as CSV (that itself is 
free of any BOMs).

The issue is manifesting on export as in two cases we've seen, one having two 
BOM's, the other three(!). It turns out that the BOM's have been imported 
somehow into the column name. The clearest evidence of this is in the HTTP 
stream on a rename,

POST 
/command/core/rename-column?oldColumnName=%EF%BB%BF%EF%BB%BFproductcode&newColum
nName=a&project=1376987350482 HTTP/1.1

Interestingly, on the rename, if I copy (Ctrl-C) the column name when it's 
highlighted the paste buffer will end up with the byte order marks (verified in 
a hexdump'ed paste) and a second rename & paste will retain the BOMs back into 
Refine(!)

What version of Google Refine are you using? 2.0 r1836

What operating system and browser are you using? Windows 7 & Chrome

Is this problem specific to the type of browser you're using or it happens in 
all the browsers you tried? Logically, not related to the browser, at least 
once the data's 'infected'. Unfortunately can't yet help on earlier than this.

I originally reported this at 
http://groups.google.com/group/google-refine/browse_thread/thread/3fa96f6297b28e
0

Original issue reported on code.google.com by paulm%pa...@gtempaccount.com on 28 Mar 2011 at 1:08

GoogleCodeExporter commented 8 years ago
Sounds like this could be related to issue 404.

Original comment by tfmorris on 14 Jun 2011 at 6:23

GoogleCodeExporter commented 8 years ago
I'm not sure - if the file were interpreted as UTF-32LE it'd be completely 
hosed.

This bug was reported on v2.0 which doesn't have the encoding-guessing 
heuristics, as far as I know.

Original comment by paulm%pa...@gtempaccount.com on 14 Jun 2011 at 2:17

GoogleCodeExporter commented 8 years ago
Character encoding guessing has been in since well before 2.0.

If you can come up with a way to reproduce this, it'd be a big help in tracking 
it down.

Does it only effect column names or all data fields?

Original comment by tfmorris on 14 Jun 2011 at 4:41