utf-8 decode error. - Githubissues

hxrain commented 9 years ago

source at: restheart/src/main/java/org/restheart/utils/ChannelReader.java:43 ByteBuffer buf = ByteBuffer.allocate(128); while (Channels.readBlocking(channel, buf) != -1) { buf.flip(); content.append(charset.decode(buf)); buf.clear(); }

"Channels.readBlocking(channel, buf)"  not check utf8 bytes break location,
if utf8 bytes even is break,"charset.decode(buf)" make error result output.

ujibang commented 9 years ago

Hi hxrain,

if I understood you are worried that the buffer is read without checking the actual buffer limit.

However this is what bf.flip() does:

public final Buffer flip()
Flips this buffer. The limit is set to the current position and then the position is set to zero. If the mark is defined then it is discarded.

Is this what you mean?

In case can you also explain how to reproduce the error.

Thanks

hxrain commented 9 years ago

Chinese/CJK character use 3 bytes in utf-8,but "Channels.readBlocking(channel, buf)" exactly read only 1 or 2 byte at end position,"charset.decode(buf)" is error.

ujibang commented 9 years ago

I have looked for this issue and found this article: http://www.oracle.com/technetwork/articles/java/supplementary-142654.html

Can you confirm that this is the issue you are facing, i.e. dealing with a charset that it is not a fixed-width 16-bit character encoding?

In case, I'm wondering if and how MongoDB deals with it; because MongoDB stores data in BSON, and BSON strings are UTF-8, thus it should not be able to handle your charset.

Have you tried to get the data with the mongo shell?

$ mongo YOURDB
MongoDB shell version: 3.0.1
connecting to: YOURDB
> db.YOURCOLLECTION.find({"_id": new ObjectId("YOUR_DOC_ID")})

You now need to check your string property. If it looks fine, it is an issue of RESTHeart; if it doesn't the issue is on MongoDB BSON...

Let me know

mkjsix commented 9 years ago

Hi all,

I already have a Python test for Chinese language and it works with RESTHeart, please have a look at: https://github.com/SoftInstigate/restheart-python-test/blob/master/test_basic_integration.py

We might add bigger Chinese documents to extend our tests, @hxrain could you help us?

hxrain commented 9 years ago

Hi: I put a big utf8 json string to RESTHeart, insert into MongoDB after,content is error. string length>128 and contain Chinese.

"ByteBuffer buf = ByteBuffer.allocate(128);" 
changed 128 to 1024,is ok.

ujibang commented 9 years ago

I double checked inserting a document with a 1000 chars long string, and it works as expected.

I suspect that you are not using a 16 bit charset encoding.

Can you please check what I suggested you in the previous comment?

ujibang commented 9 years ago

I also tried this (with httpie client) using a 128 chars long string with full-width CJK chars.

All worked.

you should send us a similar example to reproduce the issue

$ http -a a:a POST 127.0.0.1:8080/test/huge long="亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜"
HTTP/1.1 201 Created
...
Location: http://127.0.0.1:8080/test/huge/56264342b8e4f66c6c62e6f8

$ http -a a:a GET http://127.0.0.1:8080/test/huge/56264342b8e4f66c6c62e6f8\?hal=c
HTTP/1.1 200 OK
...

{
    "_created_on": "2015-10-20T13:36:02Z", 
    "_etag": {
        "$oid": "56264342b8e4f66c6c62e6f9"
    }, 
    "_id": {
        "$oid": "56264342b8e4f66c6c62e6f8"
    }, 
    "_lastupdated_on": "2015-10-20T13:36:02Z", 
    "_links": {
        "curies": [], 
        "self": {
            "href": "/test/huge/56264342b8e4f66c6c62e6f8"
        }
    }, 
    "_type": "DOCUMENT", 
    "long": "亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜"
}

mkjsix commented 9 years ago

@ujibang I tested with a long JSON string in Chinese. I have created a JSON file with the atom editor and inserted it into RESTHeart using the command line:

http -a admin:admin POST 192.168.99.100/test/chinese < chinese.json

It seems that the some word, for example 鬵鵛嚪 is stored as ��鵛嚪, so the first byte of the word is replaced by those question marks.

I can't see any error in my logs. But if I insert the same string directly with Robomongo then I don't see the wrong characters, so the issue seems to be related to RESTHeart during insertion.

$ http -a admin:admin http://192.168.99.100/test/chinese/56264dbae4b0912b466af982 HTTP/1.1 200 OK Access-Control-Allow-Credentials: true Access-Control-Allow-Origin: * Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location Auth-Token: fc01492b-c1d0-4128-9549-780ced1b6992 Auth-Token-Location: /_authtokens/admin Auth-Token-Valid-Until: 2015-10-20T14:35:53.739Z Connection: keep-alive Content-Encoding: gzip Content-Length: 2731 Content-Type: application/hal+json Date: Tue, 20 Oct 2015 14:20:53 GMT ETag: 56264dbae4b0912b466af983

{ "_created_on": "2015-10-20T14:20:42Z", "_embedded": {}, "_etag": { "$oid": "56264dbae4b0912b466af983" }, "_id": { "$oid": "56264dbae4b0912b466af982" }, "_lastupdated_on": "2015-10-20T14:20:42Z", "_links": { "curies": [ { "href": "http://restheart.org/curies/1.0/{rel}.html", "name": "rh", "templated": true } ], "rh:coll": { "href": "/test/chinese" }, "rh:document": { "href": "/test/chinese/{docid}?id_type={type}", "templated": true }, "self": { "href": "/test/chinese/56264dbae4b0912b466af982" } }, "_type": "DOCUMENT", "text": "觢敔耜媓幁惁簻臗藱, 鳼鳹鴅殽毰毲撖豇貣, 渳湥牋漻漍犕橍殧澞祣筇摓蜭蜸覟 ��鵛嚪嶜憃撊馺垼娕, 墏瓥籪艭悊惀桷邥佹筡絼綒鍆錌雔潫漦澌, 櫞氌瀙桏毢涒揗斝湁泏�� 蜙, 篧糑縒浶洯浽鼢曘穊齹鑶鸓磩磟窱莃荶衒嵥臒薽, 毹莃荶銈銙鉾澂漀潫縢羱聬蟣襋謯鳻嶬幧媝寔嵒蘥蠩滈瘑睯碫峷敊浭蠛趯嵥, 緌翢榃痯痻泏狔狑鳱, 骱眊砎粁訰貥郪娞弳鉌姎岵帔碢禗禈蟼襛墆澂漀潫塥搒楦煘煓瑐甀瞂, 礔繠褅郺鋋錋薉蕺薂箖緌翢, 侺咥垵 ��槷殦倓剟唗諃笓粊墐墆墏礯籔羻諙趉軨, 摮鑕鬞鬠煘煓瑐豅鑢鑗嬦憼緦鷃黫鼱袀豇貣 ��褆諓懥斶潣憢憉摮汫汭沎氃濈瀄槷殦馦騧騜涬淠淉氠洷銪, 鈖嗋圔樏殣氀塝埱娵鸄齴幓鮛鮥鴮氃濈瀄猺矠筸柦柋牬嶝仉圠裍裚詷殠瀪璷, 瀗犡蝺螷蟞覮郙鬯偟惝掭掝妎岓岕藒��謥蒛蜙踆逯郹酟螏螉褩饓鶪齠鋡禫穛, 磝磢賗韣顪飋棦殔湝儇齹鑶鸓紵脭脧屼汆冹圢�� 痵噮噦噞樧槧樈羭聧蔩趛踠嫷嫶嶕忀瀸蘌烗猀珖, 壾嵷幓螭蟅謕翣聜蒢痵窱縓, 咍垀緦峬峿峹儋圚墝茇茺苶楟棰鋱譖貚趪溹溦滜, 燲獯璯撌斳暩笢笣魆, 鄨鎷痵忣抏旲嬏嶟樀觢耇胇赲嗼嗹墋鈖嗋圔鶀嚵, 蜪裺璸瓁穟鑤仜伒戫摴撦嬃, 禠轒醭鏹玾珆玸璻甔礔葎萻幨懅憴臡虈觿漊硻禂, 踆鞈頨頧剆坲姏烳牼翐璻甔嗢嗂塝倱哻圁嶝仉圠殟鶾鷃, 煔澂漀潫韰頯餩珝砯砨紒翀彃圞趲萷葋蒎圢帄氕鵵鵹鵿, 酳鶀嚵巆灉礭蘠郲郔, 廲籗糴祪笰笱脀蚅蚡箹糈魦擙樲渳湥牋毚丮厹絼綒虰豖阹瀤瀪璷籗糴鋱鬎鯪鯠鍹餳駷萇雊蜩韰頯, 跬榎榯槄 ��弝彶浞浧浵駺駹, 蓨蝪嵧粞絧絏蓂蓌蓖撱雈靮傿觾韄鷡臡虈觿觨誖, 萶葠籿紁羑輐銛靾濍燂犝馺, 婂崥禠垥娀庣艎艑蔉嬦憼檓餀粞絧絏鬖鰝鰨嵀惉, 翣聜匢奾灱梴棆棎鶊鵱鶆葝 �� 愮揫濷瓂癚荾莯袎鶷鷇嗢訬軗郲檌檒濦暕銈銙鉾譖貚趪葠蜄蛖滱漮, 甀磭篧釸釪傛鈊��閍蝑蝞蝢曒檃罞耖茭蜦賕踃璈皞緪歅, 輠潧潣瑽躨钀钁庣斪緳廞徲韣顪飋鑴鱱爧馺甀瞂, 跾毞泂泀蝑蝞蝢磎磃箄縴儳贄蹝轈徲籗糴, 隒雸鼥儴壛纑臞蘬稯馦騧騜誽賚賧跬跐鉠, ��蜸覟鞻饙騴姛帡慛, 溮煡煟觶譈譀蝯瀪璷霺顤幨懅憴絒翗腏滘, 轛轝酅驨訑紱蕇蕱跠嬦��檓鍌鍗鍷磏蚙迻, 粞絧鐩闤鞿螒螝螜蓪諙踣踙饓鶪齠諃賧趡, 踆跾跠畟痄笊葎萻萶蜙 ��鉌鳭哱哸娗蕇蕱咍垀坽輣鋄銶礌簨繖澉儴壛, 榾毄舝榶榩榿螾褾賹峷敊浭榯嶝仉圠箖緌�� 鬄鵊堔埧娾郺鋋錋礌簨繖骱瞵瞷, 殀毞瑽駽髾髽礌簨繖鄻鎟霣, 訬軗郲磃箹糈慔瀁瀎鸄齴鼏噳墺鷖鼳鼲垽娭屔磏, 笢笣紽撌斳暩踣踙葝鱙鷭驨訑紱贄蹝轈潣煔滈溔滆趍跠跬螷蟞吙仜褅褌諃藒襓謥賗譧躆竀篴臌緱翬膞慛, 韎鑤仜伒蜬蝁蜠噾噿珋疧眅驧鬤鸕僄塓塕 �� 闠鞻, 腠腶舝蜦賕踃滈齠齞跣襡襙濇燖燏礛簼繰藙藨蠈, 耖茭趍匢奾灱漻漍犕, 銈醑醏�� 耜僇鄗抩枎殀嫆嫊蕷薎彃慔慛櫞氌瀙毄滱漮嵧, 泏狔狑翍脝艴鬳鴙輠" }

mkjsix commented 9 years ago

Ok, not sure about this, as I made another test and it is working. I suspect in the test above some automatic conversion was happening before hitting the server, maybe when copying the text to my editor.

I actually inserted the JSON:

{
  "use": "restheart",
  "text": "觢 敔耜 媓幁惁 簻臗藱, 鳼鳹鴅 殽毰毲 撖 豇貣, 渳湥牋 漻漍犕 橍殧澞 祣筇 摓 蜭蜸覟 鬵鵛嚪 嶜憃撊 馺 垼娕å"
}

Then no issue at all:

$ http -a admin:admin http://192.168.99.100/test/chinese/56265490e4b09
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location
Auth-Token: f10dba54-aad8-4cb3-8d08-f0e5d31dca66
Auth-Token-Location: /_authtokens/admin
Auth-Token-Valid-Until: 2015-10-20T15:08:40.610Z
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 462
Content-Type: application/hal+json
Date: Tue, 20 Oct 2015 14:53:40 GMT
ETag: 56265490e4b0912b466af985

{
    "_created_on": "2015-10-20T14:49:52Z", 
    "_embedded": {}, 
    "_etag": {
        "$oid": "56265490e4b0912b466af985"
    }, 
    "_id": {
        "$oid": "56265490e4b0912b466af984"
    }, 
    "_lastupdated_on": "2015-10-20T14:49:52Z", 
    "_links": {
        "curies": [
            {
                "href": "http://restheart.org/curies/1.0/{rel}.html", 
                "name": "rh", 
                "templated": true
            }
        ], 
        "rh:coll": {
            "href": "/test/chinese"
        }, 
        "rh:document": {
            "href": "/test/chinese/{docid}?id_type={type}", 
            "templated": true
        }, 
        "self": {
            "href": "/test/chinese/56265490e4b0912b466af984"
        }
    }, 
    "_type": "DOCUMENT", 
    "text": "觢 敔耜 媓幁惁 簻臗藱, 鳼鳹鴅 殽毰毲 撖 豇貣, 渳湥牋 漻漍犕 橍殧澞 祣筇 摓 蜭蜸覟 鬵鵛嚪 嶜憃撊 馺 垼娕å", 
    "use": "restheart"
}

hxrain commented 9 years ago

thanks ALL! thank @mkjsix! it is.

mkjsix commented 9 years ago

Good @hxrain, so is it working now? If you please can leave here some explanation of what was going wrong it might be helpful for other people dealing with Chinese language in the future.

hxrain commented 9 years ago

hi ALL I now is change 128 as 2048,temp! need modify source code,can correct process 3byte utf8 CJK string.

hxrain commented 9 years ago

fixed 128 bytes is too simplify，need check end position,don't break serial CJK UTF8 bytes.

mkjsix commented 9 years ago

So are you suggesting to change the code from: ByteBuffer buf = ByteBuffer.allocate(128) To ByteBuffer buf = ByteBuffer.allocate(2048) ?

hxrain commented 9 years ago

@mkjsix,Yes,I temp changed.because my string length<2048.

mkjsix commented 9 years ago

Ok thanks, we'll do more tests later, we need to find a general solution.

hxrain commented 9 years ago

Yes,thanks!!

mkjsix commented 9 years ago

Here it is some more interesting information: http://stackoverflow.com/questions/9860206/is-there-a-bug-while-encoding-utf-8-using-nio-buffers

mkjsix commented 9 years ago

This issue should be fixed in both master and 1.0.x branches, please verify.

hxrain commented 9 years ago

public static String read(StreamSourceChannel channel) throws IOException {
    final int capacity = 1024;

    ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);
    ByteBuffer buf = ByteBuffer.allocate(capacity);

    while (Channels.readBlocking(channel, buf) != -1) {
        buf.flip();
        if (buf.remaining()==capacity)
            os.write(buf.array());
        else
            os.write(buf.array(), 0,buf.remaining());
        buf.clear();
    }

    return new String(os.toByteArray(), CHARSET);
}

hxrain commented 9 years ago

should check data length != capacity ?

hxrain commented 9 years ago

public static String read(StreamSourceChannel channel) throws IOException { final int capacity = 1024;

ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);
ByteBuffer buf = ByteBuffer.allocate(capacity);

while (Channels.readBlocking(channel, buf) != -1) {
    buf.flip();
    os.write(buf.array(), 0,buf.remaining());
    buf.clear();
}

return new String(os.toByteArray(), CHARSET);

}

mkjsix commented 9 years ago

No, we don't need to check the data length explicitly. The capacity for ByteArrayOutputStream is just the initial capacity, its internal buffer will reallocate by itself if it needs additional memory. The ByteBuffer capacity instead is the maximum internal array length and the Channels.readBlocking implementation takes care of not overflooding it. It's easy to verify this by looking at the source code of both the mentioned classes.

I tested the new ChannelReader implementation initially with capacity=32 for a much bigger Chinese text and worked perfectly. 1024 is just a safer assumption to ensure that most of the case will be resolved with fewer loop cycles without wasting too much memory.

Please feel free to reopen this issue in case your tests show any problem with Chinese characters, thank you.

hxrain commented 9 years ago

public static String read(StreamSourceChannel channel) throws IOException {
    final int capacity = 1024;

    ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);

    ByteBuffer buf = ByteBuffer.allocate(capacity);

    int read = Channels.readBlocking(channel, buf);

    while (read != -1) {
        buf.flip();
        os.write(buf.array(), 0, read);
        buf.clear();

        read = Channels.readBlocking(channel, buf);
    }

    String ret = os.toString(CHARSET.name());

    return ret;
}

-------------------------------------::VS::----------------------------------- public static String read(StreamSourceChannel channel) throws IOException { final int capacity = 1024;

ByteArrayOutputStream os = new ByteArrayOutputStream(capacity); ByteBuffer buf = ByteBuffer.allocate(capacity);

while (Channels.readBlocking(channel, buf) != -1) { buf.flip(); os.write(buf.array(), 0,buf.remaining()); buf.clear(); }

return new String(os.toByteArray(), CHARSET); } //---------------------------------------------------------------------- Which is better? :)

ujibang commented 9 years ago

yours! go for a pull request (in master)

hxrain commented 9 years ago

If you need it, use it directly.

SoftInstigate / restheart

utf-8 decode error. #58