SoftInstigate / restheart

Rapid API Development with MongoDB
https://restheart.org
GNU Affero General Public License v3.0
807 stars 171 forks source link

utf-8 decode error. #58

Closed hxrain closed 9 years ago

hxrain commented 9 years ago

source at: restheart/src/main/java/org/restheart/utils/ChannelReader.java:43 ByteBuffer buf = ByteBuffer.allocate(128); while (Channels.readBlocking(channel, buf) != -1) { buf.flip(); content.append(charset.decode(buf)); buf.clear(); }

"Channels.readBlocking(channel, buf)"  not check utf8 bytes break location,
if utf8 bytes even is break,"charset.decode(buf)" make error result output.
ujibang commented 9 years ago

Hi hxrain,

if I understood you are worried that the buffer is read without checking the actual buffer limit.

However this is what bf.flip() does:

public final Buffer flip()
Flips this buffer. The limit is set to the current position and then the position is set to zero. If the mark is defined then it is discarded.

Is this what you mean?

In case can you also explain how to reproduce the error.

Thanks

hxrain commented 9 years ago

Chinese/CJK character use 3 bytes in utf-8,but "Channels.readBlocking(channel, buf)" exactly read only 1 or 2 byte at end position,"charset.decode(buf)" is error.

ujibang commented 9 years ago

I have looked for this issue and found this article: http://www.oracle.com/technetwork/articles/java/supplementary-142654.html

Can you confirm that this is the issue you are facing, i.e. dealing with a charset that it is not a fixed-width 16-bit character encoding?

In case, I'm wondering if and how MongoDB deals with it; because MongoDB stores data in BSON, and BSON strings are UTF-8, thus it should not be able to handle your charset.

Have you tried to get the data with the mongo shell?

$ mongo YOURDB
MongoDB shell version: 3.0.1
connecting to: YOURDB
> db.YOURCOLLECTION.find({"_id": new ObjectId("YOUR_DOC_ID")})

You now need to check your string property. If it looks fine, it is an issue of RESTHeart; if it doesn't the issue is on MongoDB BSON...

Let me know

mkjsix commented 9 years ago

Hi all,

I already have a Python test for Chinese language and it works with RESTHeart, please have a look at: https://github.com/SoftInstigate/restheart-python-test/blob/master/test_basic_integration.py

We might add bigger Chinese documents to extend our tests, @hxrain could you help us?

hxrain commented 9 years ago

Hi: I put a big utf8 json string to RESTHeart, insert into MongoDB after,content is error. string length>128 and contain Chinese.

"ByteBuffer buf = ByteBuffer.allocate(128);" 
changed 128 to 1024,is ok.
ujibang commented 9 years ago

I double checked inserting a document with a 1000 chars long string, and it works as expected.

I suspect that you are not using a 16 bit charset encoding.

Can you please check what I suggested you in the previous comment?

ujibang commented 9 years ago

I also tried this (with httpie client) using a 128 chars long string with full-width CJK chars.

All worked.

you should send us a similar example to reproduce the issue

$ http -a a:a POST 127.0.0.1:8080/test/huge long="亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜"
HTTP/1.1 201 Created
...
Location: http://127.0.0.1:8080/test/huge/56264342b8e4f66c6c62e6f8

$ http -a a:a GET http://127.0.0.1:8080/test/huge/56264342b8e4f66c6c62e6f8\?hal=c
HTTP/1.1 200 OK
...

{
    "_created_on": "2015-10-20T13:36:02Z", 
    "_etag": {
        "$oid": "56264342b8e4f66c6c62e6f9"
    }, 
    "_id": {
        "$oid": "56264342b8e4f66c6c62e6f8"
    }, 
    "_lastupdated_on": "2015-10-20T13:36:02Z", 
    "_links": {
        "curies": [], 
        "self": {
            "href": "/test/huge/56264342b8e4f66c6c62e6f8"
        }
    }, 
    "_type": "DOCUMENT", 
    "long": "亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜亜"
}
mkjsix commented 9 years ago

@ujibang I tested with a long JSON string in Chinese. I have created a JSON file with the atom editor and inserted it into RESTHeart using the command line:

http -a admin:admin POST 192.168.99.100/test/chinese < chinese.json

It seems that the some word, for example 鬵鵛嚪 is stored as ���鵛嚪, so the first byte of the word is replaced by those question marks.

I can't see any error in my logs. But if I insert the same string directly with Robomongo then I don't see the wrong characters, so the issue seems to be related to RESTHeart during insertion.

$ http -a admin:admin http://192.168.99.100/test/chinese/56264dbae4b0912b466af982 HTTP/1.1 200 OK Access-Control-Allow-Credentials: true Access-Control-Allow-Origin: * Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location Auth-Token: fc01492b-c1d0-4128-9549-780ced1b6992 Auth-Token-Location: /_authtokens/admin Auth-Token-Valid-Until: 2015-10-20T14:35:53.739Z Connection: keep-alive Content-Encoding: gzip Content-Length: 2731 Content-Type: application/hal+json Date: Tue, 20 Oct 2015 14:20:53 GMT ETag: 56264dbae4b0912b466af983

{ "_created_on": "2015-10-20T14:20:42Z", "_embedded": {}, "_etag": { "$oid": "56264dbae4b0912b466af983" }, "_id": { "$oid": "56264dbae4b0912b466af982" }, "_lastupdated_on": "2015-10-20T14:20:42Z", "_links": { "curies": [ { "href": "http://restheart.org/curies/1.0/{rel}.html", "name": "rh", "templated": true } ], "rh:coll": { "href": "/test/chinese" }, "rh:document": { "href": "/test/chinese/{docid}?id_type={type}", "templated": true }, "self": { "href": "/test/chinese/56264dbae4b0912b466af982" } }, "_type": "DOCUMENT", "text": "觢 敔耜 媓幁惁 簻臗藱, 鳼鳹鴅 殽毰毲 撖 豇貣, 渳湥牋 漻漍犕 橍殧澞 祣筇 摓 蜭蜸覟 ���鵛嚪 嶜憃撊 馺 垼娕, 墏 瓥籪艭 悊惀桷 邥佹 筡絼綒 鍆錌雔 潫 漦澌, 櫞氌瀙 桏毢涒 揗斝湁 泏��� 蜙, 篧糑縒 浶洯浽 鼢曘 穊 齹鑶鸓 磩磟窱 莃荶衒 嵥 臒薽, 毹 莃荶 銈銙鉾 澂漀潫 縢羱聬 蟣襋謯 鳻嶬幧 媝寔嵒 蘥蠩 滈瘑睯碫 峷敊浭 蠛趯 嵥, 緌翢 榃痯痻 泏狔狑 鳱, 骱 眊砎粁 訰貥郪 娞弳 鉌 姎岵帔 碢禗禈 蟼襛 墆 澂漀潫 塥搒楦 煘煓瑐 甀瞂, 礔繠 褅 郺鋋錋 薉蕺薂 箖緌翢, 侺咥垵 ��槷殦 倓剟唗 諃 笓粊 墐墆墏 礯籔羻 諙 趉軨, 摮 鑕鬞鬠 煘煓瑐 豅鑢鑗 嬦憼 緦 鷃黫鼱 袀豇貣 ��褆諓 懥斶 潣 憢憉摮 汫汭沎 氃濈瀄 槷殦馦騧騜 涬淠淉 氠洷 銪, 鈖嗋圔 樏殣氀 塝 埱娵 鸄齴 幓 鮛鮥鴮 氃濈瀄 猺矠筸 柦柋牬 嶝仉圠 裍裚詷 殠 瀪璷, 瀗犡 蝺 螷蟞覮 郙鬯偟 惝掭掝 妎岓岕 藒���謥 蒛蜙 踆 逯郹酟 螏螉褩 饓鶪齠 鋡 禫穛, 磝磢 賗 韣顪飋 棦殔湝儇 齹鑶鸓 紵脭脧 屼汆冹 圢�� 痵 噮噦噞 樧槧樈 羭聧蔩 趛踠 嫷 嫶嶕 忀瀸蘌 烗猀珖, 壾嵷幓 螭蟅謕 翣聜蒢 痵 窱縓, 咍垀 緦 峬峿峹 儋圚墝 茇茺苶 楟棰 鋱 譖貚趪 溹溦滜, 燲獯璯 撌斳暩 笢笣 魆, 鄨鎷 痵 忣抏旲 嬏嶟樀 觢 耇胇赲 嗼嗹墋 鈖嗋圔 鶀嚵, 蜪裺 璸瓁穟 鑤仜伒 戫摴撦 嬃, 禠 轒醭鏹 玾珆玸 璻甔礔 葎萻 幨懅憴 臡虈觿 漊 硻禂, 踆 鞈頨頧 剆坲姏 烳牼翐 璻甔嗢嗂塝 倱哻圁 嶝仉圠 殟 鶾鷃, 煔 澂漀潫 韰頯餩 珝砯砨 紒翀 彃 圞趲 萷葋蒎 圢帄氕 鵵鵹鵿, 酳 鶀嚵巆 灉礭蘠 郲郔, 廲籗糴 祪笰笱 脀蚅蚡 箹糈 魦 擙樲 渳湥牋 毚丮厹 絼 綒 虰豖阹 瀤瀪璷 籗糴 鋱 鬎鯪鯠 鍹餳駷 萇雊蜩 韰頯, 跬 榎榯槄 ��弝彶 浞浧浵 駺駹, 蓨蝪 嵧 粞絧絏 蓂蓌蓖撱 雈靮傿 觾韄鷡 臡虈觿 觨誖, 萶葠 籿紁羑 輐銛靾 濍燂犝 馺, 婂崥 禠 垥娀庣 艎艑蔉 嬦憼檓 餀 粞絧絏 鬖鰝鰨 嵀惉, 翣聜 匢奾灱 梴棆棎 鶊鵱鶆 葝 �� 愮揫 濷瓂癚 荾莯袎 鶷鷇 嗢 訬軗郲 檌檒濦 暕 銈銙鉾 譖貚趪 葠蜄蛖 滱漮, 甀 磭篧 釸釪傛 鈊��閍 蝑蝞蝢曒檃 罞耖茭 蜦賕踃 璈皞緪 歅, 輠 潧潣瑽 躨钀钁 庣斪 緳廞徲 韣顪飋 鑴鱱爧 馺 甀瞂, 跾 毞泂泀 蝑蝞蝢 磎磃 箄縴儳 贄蹝轈 徲 籗糴, 隒雸 鼥儴壛 纑臞蘬 稯 馦騧騜 誽賚賧 跬 跐鉠, ���蜸覟 鞻饙騴 姛帡 慛, 溮煡煟 觶譈譀 蝯 瀪璷霺顤 幨懅憴 絒翗腏 滘, 轛轝酅 驨訑紱 蕇蕱 跠 嬦���檓 鍌鍗鍷 磏 蚙迻, 粞絧 鐩闤鞿 螒螝螜 蓪 諙踣踙 饓鶪齠 諃 賧趡, 踆跾 跠 畟痄笊 葎萻萶 蜙 ��鉌鳭 哱哸娗 蕇蕱 咍垀坽 輣鋄銶 礌簨繖 澉 儴壛, 榾毄 舝 榶榩榿 螾褾賹 峷敊浭榯 嶝仉圠 箖緌��� 鬄鵊 堔埧娾 郺鋋錋 礌簨繖 骱 瞵瞷, 殀毞 瑽 駽髾髽 礌簨繖 鄻鎟霣, 訬軗郲 磃箹糈 慔 瀁瀎 鸄齴 鼏噳墺 鷖鼳鼲 垽娭屔 磏, 笢笣紽 撌斳暩 踣踙 葝 鱙鷭 驨訑紱 贄蹝轈 潣 煔 滈溔滆 趍跠跬 螷蟞吙仜 褅褌諃 藒襓謥 賗 譧躆 竀篴臌 緱翬膞 慛, 韎 鑤仜伒 蜬蝁蜠 噾噿 珋疧眅 驧鬤鸕 僄塓塕 ��� 闠鞻, 腠腶舝 蜦賕踃 滈 齠齞 跣 襡襙 濇燖燏 礛簼繰 藙藨蠈, 耖茭 趍 匢奾灱 漻漍犕, 銈 醑醏�� 耜僇鄗 抩枎殀 嫆嫊 蕷薎 彃慔慛 櫞氌瀙 毄滱漮 嵧, 泏狔狑 翍脝艴 鬳鴙 輠" }

mkjsix commented 9 years ago

Ok, not sure about this, as I made another test and it is working. I suspect in the test above some automatic conversion was happening before hitting the server, maybe when copying the text to my editor.

I actually inserted the JSON:

{
  "use": "restheart",
  "text": "觢 敔耜 媓幁惁 簻臗藱, 鳼鳹鴅 殽毰毲 撖 豇貣, 渳湥牋 漻漍犕 橍殧澞 祣筇 摓 蜭蜸覟 鬵鵛嚪 嶜憃撊 馺 垼娕å"
}

Then no issue at all:

$ http -a admin:admin http://192.168.99.100/test/chinese/56265490e4b09
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location
Auth-Token: f10dba54-aad8-4cb3-8d08-f0e5d31dca66
Auth-Token-Location: /_authtokens/admin
Auth-Token-Valid-Until: 2015-10-20T15:08:40.610Z
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 462
Content-Type: application/hal+json
Date: Tue, 20 Oct 2015 14:53:40 GMT
ETag: 56265490e4b0912b466af985

{
    "_created_on": "2015-10-20T14:49:52Z", 
    "_embedded": {}, 
    "_etag": {
        "$oid": "56265490e4b0912b466af985"
    }, 
    "_id": {
        "$oid": "56265490e4b0912b466af984"
    }, 
    "_lastupdated_on": "2015-10-20T14:49:52Z", 
    "_links": {
        "curies": [
            {
                "href": "http://restheart.org/curies/1.0/{rel}.html", 
                "name": "rh", 
                "templated": true
            }
        ], 
        "rh:coll": {
            "href": "/test/chinese"
        }, 
        "rh:document": {
            "href": "/test/chinese/{docid}?id_type={type}", 
            "templated": true
        }, 
        "self": {
            "href": "/test/chinese/56265490e4b0912b466af984"
        }
    }, 
    "_type": "DOCUMENT", 
    "text": "觢 敔耜 媓幁惁 簻臗藱, 鳼鳹鴅 殽毰毲 撖 豇貣, 渳湥牋 漻漍犕 橍殧澞 祣筇 摓 蜭蜸覟 鬵鵛嚪 嶜憃撊 馺 垼娕å", 
    "use": "restheart"
}
hxrain commented 9 years ago

thanks ALL! thank @mkjsix! it is.

mkjsix commented 9 years ago

Good @hxrain, so is it working now? If you please can leave here some explanation of what was going wrong it might be helpful for other people dealing with Chinese language in the future.

hxrain commented 9 years ago

hi ALL I now is change 128 as 2048,temp! need modify source code,can correct process 3byte utf8 CJK string.

hxrain commented 9 years ago

fixed 128 bytes is too simplify,need check end position,don't break serial CJK UTF8 bytes.

mkjsix commented 9 years ago

So are you suggesting to change the code from: ByteBuffer buf = ByteBuffer.allocate(128) To ByteBuffer buf = ByteBuffer.allocate(2048) ?

hxrain commented 9 years ago

@mkjsix,Yes,I temp changed.because my string length<2048.

mkjsix commented 9 years ago

Ok thanks, we'll do more tests later, we need to find a general solution.

hxrain commented 9 years ago

Yes,thanks!!

mkjsix commented 9 years ago

Here it is some more interesting information: http://stackoverflow.com/questions/9860206/is-there-a-bug-while-encoding-utf-8-using-nio-buffers

mkjsix commented 9 years ago

This issue should be fixed in both master and 1.0.x branches, please verify.

hxrain commented 9 years ago
public static String read(StreamSourceChannel channel) throws IOException {
    final int capacity = 1024;

    ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);
    ByteBuffer buf = ByteBuffer.allocate(capacity);

    while (Channels.readBlocking(channel, buf) != -1) {
        buf.flip();
        if (buf.remaining()==capacity)
            os.write(buf.array());
        else
            os.write(buf.array(), 0,buf.remaining());
        buf.clear();
    }

    return new String(os.toByteArray(), CHARSET);
}
hxrain commented 9 years ago

should check data length != capacity ?

hxrain commented 9 years ago

public static String read(StreamSourceChannel channel) throws IOException { final int capacity = 1024;

ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);
ByteBuffer buf = ByteBuffer.allocate(capacity);

while (Channels.readBlocking(channel, buf) != -1) {
    buf.flip();
    os.write(buf.array(), 0,buf.remaining());
    buf.clear();
}

return new String(os.toByteArray(), CHARSET);

}

mkjsix commented 9 years ago

No, we don't need to check the data length explicitly. The capacity for ByteArrayOutputStream is just the initial capacity, its internal buffer will reallocate by itself if it needs additional memory. The ByteBuffer capacity instead is the maximum internal array length and the Channels.readBlocking implementation takes care of not overflooding it. It's easy to verify this by looking at the source code of both the mentioned classes.

I tested the new ChannelReader implementation initially with capacity=32 for a much bigger Chinese text and worked perfectly. 1024 is just a safer assumption to ensure that most of the case will be resolved with fewer loop cycles without wasting too much memory.

Please feel free to reopen this issue in case your tests show any problem with Chinese characters, thank you.

hxrain commented 9 years ago
public static String read(StreamSourceChannel channel) throws IOException {
    final int capacity = 1024;

    ByteArrayOutputStream os = new ByteArrayOutputStream(capacity);

    ByteBuffer buf = ByteBuffer.allocate(capacity);

    int read = Channels.readBlocking(channel, buf);

    while (read != -1) {
        buf.flip();
        os.write(buf.array(), 0, read);
        buf.clear();

        read = Channels.readBlocking(channel, buf);
    }

    String ret = os.toString(CHARSET.name());

    return ret;
}

-------------------------------------::VS::----------------------------------- public static String read(StreamSourceChannel channel) throws IOException { final int capacity = 1024;

ByteArrayOutputStream os = new ByteArrayOutputStream(capacity); ByteBuffer buf = ByteBuffer.allocate(capacity);

while (Channels.readBlocking(channel, buf) != -1) { buf.flip(); os.write(buf.array(), 0,buf.remaining()); buf.clear(); }

return new String(os.toByteArray(), CHARSET); } //---------------------------------------------------------------------- Which is better? :)

ujibang commented 9 years ago

yours! go for a pull request (in master)

hxrain commented 9 years ago

If you need it, use it directly.