UTF-8 Mulitbyte support for Read /Write operations

GoogleCodeExporter commented 9 years ago

Feature I would like to see:

  WriteFile/ReadFile services to support UTF-8 character set. Currently they support only single byte character set.

Impact if not implemented:

 File Connector cannot be used in projects where Chinese/Japanese language content is used.

Alternatives:

 Write ws-apps code for read/write operations

Original issue reported on code.google.com by phanisri...@gmail.com on 7 Feb 2011 at 5:58

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Assigned ticket to Phani

Original comment by ma...@vandeveen.com on 7 Feb 2011 at 7:26

GoogleCodeExporter commented 9 years ago

Please apply changes in a seperate branch and ask a code review from 
Philip/Ananatha before merging it into trunk.

Original comment by ma...@vandeveen.com on 7 Feb 2011 at 7:41

GoogleCodeExporter commented 9 years ago

Is coelib sources also open sourced? Coelib sources has to be fixed for UTF-8 
issue.

Original comment by phanisri...@gmail.com on 9 Feb 2011 at 9:32

GoogleCodeExporter commented 9 years ago

Technically speaking: yes, the coelib is also open source. But the code is not 
yet on googlecode. I talked to Marco yesterday and he said he will arrange it 
asap.

Original comment by pgus...@gmail.com on 10 Feb 2011 at 6:38

GoogleCodeExporter commented 9 years ago

Correct the Coe lib is planned to be open sourced as well. I'll plan to pick it 
up next week.

Original comment by ma...@vandeveen.com on 17 Feb 2011 at 4:38

GoogleCodeExporter commented 9 years ago

Node.getData() API does not return proper UTF-8 content, if the character takes 
2 bytes. The reason for this is Node.getData() is a native call and java's JNI 
for creating a string is corrupting if utf-8 bytes. I am waiting for the 
workaround from PD team. If not, then I will modify coelib sources to use 
Node.writeToString() api rather that Node.getData()

Original comment by phanisri...@gmail.com on 17 Feb 2011 at 11:32

GoogleCodeExporter commented 9 years ago

CoE Lib has been open sourced. See http://code.google.com/p/cordyscoelib/

Original comment by mvdv...@cordys.com on 2 Aug 2011 at 8:43

Changed state: Accepted
Added labels: hasb

GoogleCodeExporter commented 9 years ago

CoE lib uses platform NOM libraries to do some xml parsing

Apparently there is an bug related to Node.getData(). It returns UTF-8 string, 
ist should return and UTF-16 based string. 

UTF-16 has a far more wide charactacer set, including most/all of the Japanese 
characters. 

It looks like this issue is solved in BOP-4 CU18. Phani is currently validating 
this.

Original comment by mvdv...@cordys.com on 2 Aug 2011 at 8:57

GoogleCodeExporter commented 9 years ago

The issue is solved in BOP-4 CU17, Anantha is validating this.

Original comment by mvdv...@cordys.com on 2 Aug 2011 at 10:40

GoogleCodeExporter commented 9 years ago

Original comment by mvdv...@cordys.com on 2 Aug 2011 at 12:59

Added labels: Type-Defect
Removed labels: Type-Enhancement

GoogleCodeExporter commented 9 years ago

I have the issue such as issue "12" (My issue number is 30).

I use the BOP-4 CU14.
If I set UTF-16 based string in BOP-4 CU14, I can solve this problem?
If it's yes, which item should I set UTF-16 in?

Or, Should I update CU version since CU17?

Please kindly advise me.

Original comment by komiy...@japacom.co.jp on 3 Aug 2011 at 8:20

GoogleCodeExporter commented 9 years ago

We receveived feedback from the platform team that the the NOM issue (BOP's XML 
libraries) should be solved in CU17 and later.

Anantha is currently validating this statement with the file connector. Please 
wait for the results of Ananatha. 

Setting UTF-16 won't help as far as currenlty known.

Original comment by mvdv...@cordys.com on 3 Aug 2011 at 8:54

GoogleCodeExporter commented 9 years ago

Update: It's not a platform issue. Most of the connector is single byte at this 
moment. (except for the ReadLargeXMLFiles method)

Anantha is investigating what it takes to meke the full connector multibyte.

Original comment by ma...@vandeveen.com on 8 Aug 2011 at 6:07

GoogleCodeExporter commented 9 years ago

Issue 30 has been merged into this issue.

Original comment by ma...@vandeveen.com on 9 Aug 2011 at 6:56

GoogleCodeExporter commented 9 years ago

Hi komiya_m,

We investigated: it's not an easy fix and will take quite some development to 
get multibyte support. 

Are you interested in implementing this yourself and share it back to the 
community?

See contribute section here: https://wiki.cordys.com/x/bwIZ

Regards,
Marco

Original comment by ma...@vandeveen.com on 9 Aug 2011 at 7:43

GoogleCodeExporter commented 9 years ago

What's the status of this defect? Will it be implemented?
Regards,
Ton...

Original comment by tdou...@gmail.com on 2 Sep 2014 at 7:55

MatthiasEberl / cordysfilecon

UTF-8 Mulitbyte support for Read /Write operations #12