Open sysarchitect opened 10 years ago
Based on some small tests (you can play around with the number formatting library: http://oss.sheetjs.com/ssf/) I think the issue here boils down to the codepage encoding of the number format and the lack of localization (to be sure, the number 12345.6789 in the US version renders as 12,346p.
, which is definitely not what your version shows).
I would like to take a peek at those files and figure out the best strategy for the localization. Can you make a sample sheet with a cell from that format and save it as (in this particular order):
If it warns before saving in any of the formats, that's OK. If you could share the files with me (either email or putting them somewhere) I can take a look.
I deeply appreciate your help :)
See attached files. I tested on XLS and XLSX only: Column H, I - wrong parsing. Column J - ok.
"Ñ." instead of local currency symbol is not really big issue but it appears only with XLSX format.
Note that there are two additional sheets with objects map - to be ignored. Original file was in XLS 2003 format.
Thank you.
On 01.07.2014 11:35, SheetJSDev wrote:
Based on some small tests (you can play around with the number formatting library: http://oss.sheetjs.com/ssf/) I think the issue here boils down to the codepage encoding of the number format and the lack of localization (to be sure, the number 12345.6789 in the US version renders as |12,346p.|, which is definitely not what your version shows).
I would like to take a peek at those files and figure out the best strategy for the localization. Can you make a sample sheet with a cell from that format and save it as (in this particular order):
- Excel Binary Workbook (XLSB)
- Excel Workbook (XLSX)
- Excel 97-2003 Workbook (XLS)
- XML Spreadsheet 2003
- Microsoft Excel 5.0/95 Workbook
If it warns before saving in any of the formats, that's OK. If you could share the files with me (either email or putting them somewhere) I can take a look.
I deeply appreciate your help :)
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-47625655.
Unfortunately github doesn't support attachments in the email. Not to worry though -- I was able to produce a file by changing my computer settings to Russian.
The XLSX "Ñ."
issue is a very simple fix: the utf characters have to be processed in the number format:
https://github.com/SheetJS/js-xlsx/blob/master/bits/47_styxml.js#L57
--- var f=unescapexml(y.formatCode), j=parseInt(y.numFmtId,10);
+++ var f=unescapexml(utf8read(y.formatCode)), j=parseInt(y.numFmtId,10);
The other half of the problem (thousands separator and decimal character) is locale specific. To see this, in Windows "Region and Language" settings, the separators are in the "Additional Settings" pane.
If you look at the file (unzip the xlsx and look at the file xl/styles.xml), you'll see the format is stored assuming that the comma is the thousand separator and the dot is the decimal. The fix probably looks like this:
1) The formatting library should take some parameters to localize the output
2) The parsers should accept locale options that would control the output separators
3) If possible, the parsers should try to guess the locale of the person who last saved the file.
4) Attempt to resolve a conflict between east asian versions of excel (some number formats like 56 are custom formats in the US version but have different meanings in the various east asian languages)
5) Run down the list of currencies and make sure that every currency is properly formatted
@elad In Hebrew, are numbers/currencies written right-to-left or left-to-right?
Numbers are written left-to-right, and currency unit appears on the left:
@sysarchitect you've opened a big can of worms. I took a look at the number format test, testing with different location settings, and found:
1) the month names should be in the specified locale
2) the VBA Format function is not fully localized
3) the Excel understanding of built-in formats is broken: formats which show up as standard in the US locale are treated as custom formats in Russian
4) There appears to be no mention of the locale in the XLSX file.
Good news is that this setting can be controlled very easily in Excel 2013, so it shouldn't be too hard to create a list for each of the locales.
Hello,
Glad to hear. Understanding is the half of solution ) I do believe js-xlsx is the only really working client tool. Hope issues can be solved.
For most purposes 2013 is enough I think. Thank you.
On 04.07.2014 8:00, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect you've opened a big can of worms. I took a look at the number format test https://github.com/SheetJS/test_files/blob/master/number_format.xlsm, testing with different location settings, and found:
1) the month names should be in the specified locale
2) the VBA Format function is not fully localized
3) the Excel understanding of built-in formats is broken: formats which show up as standard in the US locale are treated as custom formats in Russian
4) There appears to be no mention of the locale in the XLSX file.
Good news is that this setting can be controlled very easily in Excel 2013, so it shouldn't be too hard to create a list for each of the locales.
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-48007209.
@sysarchitect @elad when you save as "CSV", does Excel save with semicolons or commas?
For example, this is what I saw when saving the number_format baseline as csv in Russian: https://github.com/SheetJS/test_files/blob/master/number_format_russian.0.csv
For me the default is commas, but I don't use a localized version of Excel, so I don't know if that bit of information is of any help. :)
@elad I should have tested Hebrew before asking -- Excel is horribly inconsistent. For example, this is the date 18/10/1933 16:17:37
(number 12345.6789
under format dd-mmm-yy
) under Excel 2013 when the computer location is set to Israel and the language is set to Hebrew:
Excel saves it as 18-???-33
in the CSV. This is the unicode text:
Which is the correct rendering? The US date is "October 18, 1933"
For Hebrew CSV files, I think I found out you had to have a BOM character first. At least that's what I had to do in my code for the files to actually have readable contents.
The first rendering (18-???-33
) is obviously wrong, the second is correct but written in reverse. So the middle word, which should read "oct," actually reads "tco." As you can see the day and year are also in the wrong sides.
The first thing I would try would be to put in a BOM character, if that doesn't work we can debug further.
@elad mystery solved: when saving a file as CSV, Excel attempts to use the local codepage. That's actually controlled by a different setting (for "non-Unicode applications", strange since Excel is clearly unicode aware).
So when I generated the baseline using codepage 1252 (the standard US codepage) the hebrew characters are invalid (so they were rendered as ?
). When the local codepage is 1255 (which is what the Hebrew actually requires), then it saves as random ascii characters (like éåí øáéòé 18 àå÷èåáø 1933
). Incidentally, if you convert from the Hebrew codepage 1255 to UTF8 things look better:
$ codepage -f 1255 -t 65001 -o new.csv old.csv
$ cat new.csv
יום רביעי 18 אוקטובר 1933
Hello,
In Excel: 1234567 1234567 1234567,00 1234567,00 1 234 567,00 р. 1 234 567,00 р.
Saved in CSV as:
1234567;1234567 1234567,00;1234567,00 1 234 567,00 р.;1 234 567,00 р.
On 08.07.2014 22:50, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect @elad https://github.com/elad when you save as "CSV", does Excel save with semicolons or commas?
For example, this is what I saw when saving the number_format baseline as csv in Russian: https://github.com/SheetJS/test_files/blob/master/number_format_russian.0.csv
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-48383468.
@sysarchitect I haven't forgotten this :)
With regards to the XLSX Ñ
format character, can you check against the latest version (0.7.8)?
With regards to the actual format processing, there are two sub-problems.
A) Determine the location information from the file. This is the status:
format | excel version | locale info in file |
---|---|---|
XLS BIFF5 | 5.0/95 | Country Record |
XLS (BIFF8) | 97-2004 | Country Record |
XMLSS | 2003-2004 | Not Available |
XLSX/XLSM | 2007+ | Not Available |
XLSB | 2007+ | Not Available |
B) Use the location information to generate properly formatted text.
The current snag is that the date information is localized as well. For example, consider the month format mmm
. In different locales (even if the language may be the same), they produce different month strings (on a side note, in many languages the text is a phonetic transcription of the english name). This database of names and related logic to switch locales needs to be stored somewhere, and currently take up nearly 100KB (which I'm working to shrink).
Also, can you directly send me a throwaway set of files (XLS, XLSX, XLSB, XML) using the problematic formats that I can add to the test suite? Replying to this email unfortunately doesn't forward the attachments, so you have to send it to dev -- sheetjs -- com
Hello,
H.N.Y. is less expected then new version )
I tested on XLSX only. All other formats you listed are just "save as" copies. Saving in XML is not allowed somehow.
Thank you.
On 30.07.2014 17:47, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect I haven't forgotten this :)
With regards to the XLSX |Ñ| format character, can you check against the latest version (0.7.8)?
With regards to the actual format processing, there are two sub-problems.
A) Determine the location information from the file. This is the status:
format excel version locale info in file XLS BIFF5 5.0/95 Country Record XLS (BIFF8) 97-2004 Country Record XMLSS 2003-2004 Not Available XLSX/XLSM 2007+ Not Available XLSB 2007+ Not Available
B) Use the location information to generate properly formatted text.
The current snag is that the date information is localized as well. For example, consider the month format |mmm|. In different locales (even if the language may be the same), they produce different month strings (on a side note, in many languages the text is a phonetic transcription of the english name). This database of names and related logic to switch locales needs to be stored somewhere, and currently take up nearly 100KB (which I'm working to shrink).
Also, can you directly send me a throwaway set of files (XLS, XLSX, XLSB, XML) using the problematic formats that I can add to the test suite https://github.com/SheetJS/test_files? Replying to this email unfortunately doesn't forward the attachments, so you have to send it to dev@sheetjs.com mailto:dev@sheetjs.com
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-50616116.
@sysarchitect Feel free to send the code so we can review :)
The "github way" to do this is to fork the repo (hit the fork button), commit changes, push them to your fork and create a "pull request". Alternatively, you can just paste the function body in a reply and we can take a look. If you want to add it a reply, add three backticks (```
) before and after the code:
function this_is_a_test() {
return 42;
}
Hello,
OK Note that my code would not be applicable for you because it uses GUIDs and tabs as row separators to prepare XML for some project reason. The idea is that browser can read XML directly through DOM and MSSQL can use XML as stored procedure input parameter. In fact it is the only way to paste data table into SQL at one step.
<?xml version='1.0' encoding='UTF-16' ?>
<root>
<Row ID="1" _1="12345" _2="test string 1" _3="01.12.2014"/>
<Row ID="2" _1="678.33" _2="test string 2" _3="02.12.2014"/>
<Row ID="3" _1="12345.45 р." _2="" _3="03.12.2014" _4="""/>
</root>
Where ID is еру Excel row number. It is necessary for back messages to show user is something wrong in some rows. _1 _2 _3 etc are attributes for Excel columns (A, B, C, ...)
The ideal solution would be to implement XML function that returns the format above at library level.
My modification of sheet_to_csv function:
function sheet_to_csv(sheet, opts)
{
var out = [], txt = "";
opts = opts || {};
if(!sheet || !sheet["!ref"]) return "";
var r = decode_range(sheet["!ref"]);
var fs = opts.FS||",", rs = opts.RS||"\n";
for(var R = r.s.r; R <= r.e.r; ++R) {
var row = [];
for(var C = r.s.c; C <= r.e.c; ++C) {
var val = sheet[encode_cell({c:C,r:R})];
if(!val) { row.push(""); continue; }
txt = String(format_cell(val));
if(txt.indexOf(fs)!==-1 || txt.indexOf(rs)!==-1 ||
txt.indexOf('"')!==-1)
txt = "\"" + txt.replace(/"/g, '""') + "\"";
row.push(txt);
}
//out.push(row.join(fs));
out += (R+1).toString() +
"B9B4F8CE-5D92-4D4E-B6D8-2DEAF81789FF" + row.join("\t") +
"FC130C59-37DD-4364-B465-C54213C46434";
}
//return out.join(rs) + (out.length ? rs : "");
return out;
}
//___________________________________________________________________
//___________________________________________________________________
function sheet_to_csv(sheet, opts)
{
var out = [], txt = "";
opts = opts || {};
if(!sheet || !sheet["!ref"]) return "";
var r = decode_range(sheet["!ref"]);
var fs = opts.FS||",", rs = opts.RS||"\n";
for(var R = r.s.r; R <= r.e.r; ++R) {
var row = [];
for(var C = r.s.c; C <= r.e.c; ++C) {
var val = sheet[encode_cell({c:C,r:R})];
if(!val) { row.push(""); continue; }
txt = String(format_cell(val));
if(txt.indexOf(fs)!==-1 || txt.indexOf(rs)!==-1 ||
txt.indexOf('"')!==-1)
txt = "\"" + txt.replace(/"/g, '""') + "\"";
row.push(txt);
}
/*
out.push(row.join(fs));
}
return out.join(rs) + (out.length ? rs : "");
*/
out += (R+1).toString() +
"B9B4F8CE-5D92-4D4E-B6D8-2DEAF81789FF" + row.join("\t") +
"FC130C59-37DD-4364-B465-C54213C46434"; //out.push(row.join(fs));
}
return out;
}
//___________________________________________________________________
Application calls handleFileSelect, where Caller is the HTML element
which fires event, XMLDoc is the empty DOM XML like <?xml version='1.0'
encoding='UTF-16' ?>
Script manipulations in handleFileSelect are all about avoiding browser caching and using properly Excel library depending of input file.
//___________________________________________________________________
var IsGoodExcel = false;
function handleFileSelect(Caller, XMLDoc)
{
//ClearXML(XMLDoc, 2);
var ExcelType;
var head = document.getElementsByTagName('head')[0];
var script_xlsx, script_jszip, script_shim;
var script_xls;
var XLSX_ext = ".xlsx";
var XLS_ext = ".xls";
var files = Caller.files; // FileList object
//for (var i = 0, f; f = files[i]; i++)
//{
var reader = new FileReader();
var f = files[0];
var name = f.name;
if(name.toLowerCase().indexOf(XLSX_ext) + XLSX_ext.length ==
name.length)
{
ExcelType = "XLSX";
script_jszip = document.createElement('script');
script_jszip.type = 'text/javascript';
//script_jszip.src = "../Js/js-xlsx_files/jszip.js";
script_jszip.text =
f_ReadWrite("../Js/js-xlsx_files/jszip.js");
head.appendChild(script_jszip);
script_xlsx = document.createElement('script');
script_xlsx.type = 'text/javascript';
//script_xlsx.src = "../Js/js-xlsx_files/xlsx.js";
script_xlsx.text = f_ReadWrite("../Js/js-xlsx_files/xlsx.js");
head.appendChild(script_xlsx);
script_shim = document.createElement('script');
script_shim.type = 'text/javascript';
//script_shim.src = "../Js/js-xlsx_files/shim.js";
script_shim.text = f_ReadWrite("../Js/js-xlsx_files/shim.js");
head.appendChild(script_shim);
}
if(name.toLowerCase().indexOf(XLS_ext) + XLS_ext.length ==
name.length)
{
ExcelType = "XLS";
script_xls = document.createElement('script');
script_xls.type = 'text/javascript';
//script_xls.src = "../Js/js-xls_files/xls.js";
script_xls.text = f_ReadWrite("../Js/js-xls_files/xls.js");
head.appendChild(script_xls);
script_shim = document.createElement('script');
script_shim.type = 'text/javascript';
//script_shim.src = "../Js/js-xls_files/shim.js";
script_shim.text = f_ReadWrite("../Js/js-xls_files/shim.js");
head.appendChild(script_shim);
}
reader.onload = function(e)
{
var data = e.target.result;
var wb;
try
{
var arr = fixdata(data);
//alert(arr.length);
switch(ExcelType)
{
case "XLSX":
wb = XLSX.read(btoa(arr),
{type: 'base64'});
break;
case "XLS":
wb = XLS.read(arr,
{type:'binary'});
break;
default:
break;
}
process_wb(wb, XMLDoc, ExcelType);
IsGoodExcel = true;
}
catch(e)
{
IsGoodExcel = false;
}
};
reader.readAsArrayBuffer(f);
try
{
if(typeof script_jszip!=='undefined')
head.removeChild(script_jszip);
if(typeof script_xlsx!=='undefined') head.removeChild(script_xlsx);
if(typeof script_xls!=='undefined') head.removeChild(script_xls);
if(typeof script_shim!=='undefined') head.removeChild(script_shim);
}
catch(e)
{
alert(e);
}
Caller.parentNode.innerHTML = Caller.parentNode.innerHTML;
return name;
}
function fixdata(data)
{
var o = "", l = 0, w = 10240;
for(; l<data.byteLength/w; ++l)
o+=String.fromCharCode.apply(null,new
Uint8Array(data.slice(l*w,l*w+w)));
o+=String.fromCharCode.apply(null, new
Uint8Array(data.slice(o.length)));
return o;
}
function process_wb(wb, XMLDoc, ExcelType)
{
var String, CellValue;
var Row;
var root = XMLDoc.documentElement;
var Data = XMLDoc.createElement("Data");
root.appendChild(Data);
var rows = to_csv(wb,
ExcelType).toString().split("FC130C59-37DD-4364-B465-C54213C46434");
for(var Y=0; Y<rows.length-1; Y++)
{
var RowData =
rows[Y].split("B9B4F8CE-5D92-4D4E-B6D8-2DEAF81789FF");
String = RowData[1].split("\t");
Row = XMLDoc.createElement("Row");
Row.setAttribute("RowID", RowData[0]);
for(var X=0; X<String.length; X++)
{
Row.setAttribute("_"+(X+1).toString(),
unescape(String[X].replace(/\"/g,"")));
}
Data.appendChild(Row);
Row = null;
}
}
function to_csv(workbook, ExcelType)
{
var result = [];
workbook.SheetNames.forEach(
function(sheetName)
{
var csv;
switch(ExcelType)
{
case "XLSX":
csv =
XLSX.utils.sheet_to_csv(workbook.Sheets[sheetName]);
break;
case "XLS":
csv =
XLS.utils.sheet_to_csv(workbook.Sheets[sheetName]);
break;
default:
break;
}
//alert(csv);
if(csv.length > 0)
{
//result.push("SHEET: " +
sheetName);
//result.push("");
result.push(csv);
}
}
);
return result.join("\n");
}
//___________________________________________________________________
sysarchitect
On 31.07.2014 16:57, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect Feel free to send the code so we can review :)
The "github way" to do this is to fork the repo (hit the fork button), commit changes, push them to your fork and create a "pull request". Alternatively, you can just paste the function body in a reply and we can take a look. If you want to add it a reply, add three backticks (|```|) before and after the code:
function this_is_a_test() { return 42; } — Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-50754699.
@sysarchitect if I understand what you want to do correctly, you can make your own function that uses sheet_to_json (this function can be in its own script, so you don't have to change xlsx.js or xls.js to do this):
/* xml entity encoding */
var encodeval = (function(){
var rencoding = {
'"': '"',
"'": ''',
'>': '>',
'<': '<',
'&': '&'
};
var decregex=/[&<>'"]/g, charegex = /[\u0000-\u0008\u000b-\u001f]/g;
return function escapexml(text){
if(text == null) return "";
var s = text + '';
return s.replace(decregex, function(y) { return rencoding[y]; }).replace(charegex,function(s) { return "_x" + ("000"+s.charCodeAt(0).toString(16)).substr(-4) + "_";});
};
})();
function sheet_to_xml(sheet) {
var data = XLSX.utils.sheet_to_json(sheet, {header:1});
var o = ['<root>'], t = "", d;
for(var i = 0, ilen = data.length; i != ilen; ++i) {
d = data[i];
t = '<Row';
t += ' ID="' + (i+1) + '"';
for(var j = 0, jlen = d.length; j != jlen; ++j) t += ' _' + (j+1) + '="' + encodeval(d[j]) + '"';
t += '/>';
o.push(t);
}
o.push('</root>');
return o.join("");
}
If you need UTF-16 encoding, then use codepage:
function sheet_to_xml_utf16(sheet) {
if(typeof require !== 'undefined' && typeof cptable === 'undefined') cptable = require('codepage');
return cptable.utils.encode(1200, sheet_to_xml(sheet), 'str');
}
Hello,
We're going to production with our project in one two months. Do we have to wait for currency formats fix or use current version? (this is just a question :)
Best Regards, Ilya Loskutov
On 09.07.2014 10:14, Ilya Loskutov wrote:
Hello,
In Excel: 1234567 1234567 1234567,00 1234567,00 1 234 567,00 р. 1 234 567,00 р.
Saved in CSV as:
1234567;1234567 1234567,00;1234567,00 1 234 567,00 р.;1 234 567,00 р.
On 08.07.2014 22:50, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect @elad https://github.com/elad when you save as "CSV", does Excel save with semicolons or commas?
For example, this is what I saw when saving the number_format baseline as csv in Russian: https://github.com/SheetJS/test_files/blob/master/number_format_russian.0.csv
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-48383468.
@sysarchitect we are very close to pushing a boatload of logic to fix currency as well as date/time and other localization issues. Stay tuned :)
Long story short, the SSF module will mirror the C localization functions (e.g. setlocale, LC_NUMERIC). After checking every locale (Windows does not make it easy to switch regions and languages) I have a rough sense for how windows locale information affects the formatting, and it appears to mirror the C localization system.
Glad to hear ) Thank you
On 13.10.2014 9:44, SheetJSDev wrote:
@sysarchitect https://github.com/sysarchitect we are very close to pushing a boatload of logic to fix currency as well as date/time and other localization issues. Stay tuned :)
Long story short, the SSF module will mirror the C localization functions (e.g. setlocale, LC_NUMERIC). After checking every locale (Windows does not make it easy to switch regions and languages) I have a rough sense for how windows locale information affects the formatting, and it appears to mirror the C localization system.
— Reply to this email directly or view it on GitHub https://github.com/SheetJS/js-xlsx/issues/78#issuecomment-58847937.
Hello I have looked through this thread to find an answer, but semicolon to comma is the closest to what I have gotten because what happens to me is the following.
XLSX format output Software,"$6,526",$227,"$45,797","$42,354","$7,931","$15,802","$45,210","$49,233","$32,049","$7,519","$49,916","$21,721"
CSV format output Software,$6.526,$227,$45.797,$42.354,$7.931,$15.802,$45.210,$49.233,$32.049,$7.519,$49.916,$21.721
As you can see that the XLSX for some reason surrounds the cell that contains the "." with guotes (which I don't mind), but what my problem is that it changes it to comma which makes it really hard to split cell by cell so that I can recreate it. Have you encountered this before and if yes do you know a possible reason/fix for the mentioned problem. Any input would be highly appreciated, Thank you :)
@mandros1 I've found same issue, any solution for this?
@MiqueiasGFernandes yikes, a long time has passed since I posted this, so I don't even remember what was the project/case I had issue with as described above. I do remember that I ultimately did a hacky fix by splitting and replacing, but don't remember what exactly. I also do know that I didn't solve it using this library, but I think this might be fixable by adapting some configurations or passing the format type for the parser to use surely (I just wasn't keen on reading the documentation back then). Sorry I couldn't be of more help, good luck!
Hello,
Cell formatted as shown (Russian currency format) parses as:
Can you fix please? Thank you.