agirbal / umongo

Desktop app to browse and administer your MongoDB cluster
http://www.edgytech.com/umongo/
580 stars 94 forks source link

Wrong export collection format in multi-languages (with arabic) #221

Closed SONEINT closed 10 years ago

SONEINT commented 10 years ago

MongoDB : 4.2.2 version Data Mining program : Intellij4Idea JAVA programm with Twitter4j library Aim : Data mining Twitter, storage in MongoDB, analysis in R Utility of Umongo : limitations of rmongdb library in R with large data sets

I have a problem with export function in Umongo with tweets in arabic :

1/ BSON format appears to be not readable in text format on Windows or MacOSX ; arabic terms could be detected in R with text mining clustering algorihms ; BSON files are not readable in R (data and metadata) ;

Ex: One tweet in english (OK) and one tweet in arabic (KO here but text of the tweet readable on R)

_id RÉÎM6w¡ÿZ‡  sleymen71 retweet_count  tweet_followers_count ~ source P TweetCaster for Android tweet_mentioned_count  tweet_ID H§ðÓtweet_text ? RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4 O 

_id RÉÎM6w¡ÿZ‡¡user_name InnssaannI retweet_count  tweet_followers_count ñ source  web tweet_mentioned_count  tweet_ID `ˆâðÓtweet_text ¤ RT @fahadjabbar1: داعش وبيان حقيقتهم Ø› كلمه موجزه ورائعه للشيخ عبدالعزيز الفوزان عن http://t.co/DFFJktaVmw 

2/ JSON format appears to be readable but arabic terms are not understandablein text format on Windows or MacOSX ; arabic terms could not be detected in R with text mining clustering algorihms ; JSON files are readable in R (except arabic text of the tweets) ;

Ex: One tweet in english (OK) and one tweet in arabic (KO with ????)

{ "_id" : { "$oid" : "52c9ce4d36771ea1ff5a87a0"} , "user_name" : "sleymen71" , "retweet_count" : 5 , "tweet_followers_count" : 126 , "source" : "<a href=\"http://www.tweetcaster.com\" rel=\"nofollow\">TweetCaster for Android" , "tweet_mentioned_count" : 1 , "tweet_ID" : 419943196131856384 , "tweet_text" : "RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4"}

{ "_id" : { "$oid" : "52c9ce4d36771ea1ff5a87a1"} , "user_name" : "InnssaannI" , "retweet_count" : 7 , "tweet_followers_count" : 753 , "source" : "web" , "tweet_mentioned_count" : 1 , "tweet_ID" : 419943192830959616 , "tweet_text" : "RT @fahadjabbar1: ???? ????? ??????? ? ???? ????? ?????? ????? ????????? ??????? ?? http://t.co/DFFJktaVmw"}

Ex: One tweet in english (OK) and one tweet in arabic (KO with ????)

{ "$oid" : "52c9ce4d36771ea1ff5a87a0"},"sleymen71",5,126,"<a href=\"http://www.tweetcaster.com\" rel=\"nofollow\">TweetCaster for Android",1, null ,"RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4"

{ "$oid" : "52c9ce4d36771ea1ff5a87a1"},"InnssaannI",7,753,"web",1, null ,"RT @fahadjabbar1: ???? ????? ??????? ? ???? ????? ?????? ????? ????????? ??????? ?? http://t.co/DFFJktaVmw"

Conclusion : export is not coherent here and I can not analyse my data sets on R for data mining and machine learning.

Cyrille.

agirbal commented 10 years ago

Sorry for the delay on this, looking into it

agirbal commented 10 years ago

It looks like the shell doesnt like the arabic UTF8, it screws up the prompt... I'll try to insert from umongo. Otherwise, any chance you could send me the BSON docs that fail?

agirbal commented 10 years ago

It looks like I am able to insert and export the following documents fine. The text should really be specified in UTF8, otherwise bad things will happen, both in mongo drivers and the shell. UMongo mostly deals with Java strings (~UTF16) as passed by the driver, so there is not much to fix there, as long as the driver can (de)serialize them. On import / export to JSON / BSON it is assumed that the strings are UTF8.

{ "_id" : { "$oid" : "538ba9a9add22126d4c5bc50"} , "str" : "تشكيل نهاية حدة إذ. قد وإيطالي المتساقطة، دون, مع بزوال بينما وفي. مع دحر التخطيط والروسية, للجزر أوكيناوا وبريطانيا من قصف. جوي عل وأكثرها بريطانيا-فرنسا."} { "_id" : { "$oid" : "538bab2b3004255058e8d6f5"} , "str" : "أنا قادر على أكل الزجاج و هذا لا يؤلمني."}

Maybe you could send me documents that are broken thx

SONEINT commented 10 years ago

Hi,

I have tested with several DATA sets, with same results. I made Umongo choice for big datasets, on which rmongodb not gave me very good results (probably to big & my PC with only 16Go of RAM - extraction stopped on R after 40000 rows, not constant). But rmongodb worked properly with relative small DATA sets with Arabic UTF8, and gave me the opportunity to analyze the DATA on R.

I am not at home for a couple of days. I will have a look on tuesday evening to see if I can share with you a representative DATA set, perhaps through my Github account.

Cyrille Skype: clocloauboulot PS: working to get a PhD degree on social network analysis & Terrorism studies.

Le 2 juin 2014 à 00:35, Antoine Girbal notifications@github.com a écrit :

It looks like the shell doesnt like the arabic UTF8, it screws up the prompt... I'll try to insert from umongo. Otherwise, any chance you could send me the BSON docs that fail?

— Reply to this email directly or view it on GitHub.

SONEINT commented 10 years ago

OK. I use a JAVA program to collect DATA from Twitter API. I assume that if rmongodb do it properly on small DATA sets, it should be fine on Umongo ? Where can I check the tweets format on Umongo ? I can read the terms in Arabic in all the Umongo windows.

Cyrille

Le 2 juin 2014 à 00:43, Antoine Girbal notifications@github.com a écrit :

It looks like I am able to insert and export the following documents fine. The text should really be specified in UTF8, otherwise bad things will happen, both in mongo drivers and the shell. UMongo mostly deals with Java strings (~UTF16) as passed by the driver, so there is not much to fix there, as long as the driver can (de)serialize them. On import / export to JSON / BSON it is assumed that the strings are UTF8.

{ "_id" : { "$oid" : "538ba9a9add22126d4c5bc50"} , "str" : "تشكيل نهاية حدة إذ. قد وإيطالي المتساقطة، دون, مع بزوال بينما وفي. مع دحر التخطيط والروسية, للجزر أوكيناوا وبريطانيا من قصف. جوي عل وأكثرها بريطانيا-فرنسا."} { "_id" : { "$oid" : "538bab2b3004255058e8d6f5"} , "str" : "أنا قادر على أكل الزجاج و هذا لا يؤلمني."}

Maybe you could send me documents that are broken thx

— Reply to this email directly or view it on GitHub.

agirbal commented 10 years ago

Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:

SONEINT commented 10 years ago

You are right. Sorry for my poor english, I need to progress :-). Anyway :

Thanks for all.

Cyrille.

Le 2 juin 2014 à 03:19, Antoine Girbal notifications@github.com a écrit :

Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:

you collect data from twitter api with a Java app and write it to mongodb. I'm assuming all strings are handled as UTF from UMongo, you do a find(), can you see the characters properly in the window? from UMongo, you do exports, do those look correct, if not which one is broken? thanks PS: sounds interesting :) — Reply to this email directly or view it on GitHub.

SONEINT commented 10 years ago

Hi,

Sorry just come back home. I can push you a small DafaSet to test your hand. How can I do that (I have exported small datasets around 75 Mo each, not possible by email). Seems that on my hand arabic terms are correctly displayed in Umongo in the Result panel after a find function. When you select text version on Umngo, arabic are replaced by ???? When I export to all format, on Windows or Mac OS X, analysis with R with different functions (data.table, et...) gives me ??? in place of arabic terms.

Cyrille.

Le 2 juin 2014 à 07:20, Cyrille Papon social.network.intel@me.com a écrit :

You are right. Sorry for my poor english, I need to progress :-). Anyway :

  • Data format are readable in Umongo & command find works properly, finding terms in UTF8 format for arabic.
  • I'll check one more time today with the exports I have done a few weeks ago. I have changed R version, and learned new R tricks to clean DATA sets on R with different packages.
  • If unsuccessful, I will send you a sample tomorrow night.

Thanks for all.

Cyrille.

Le 2 juin 2014 à 03:19, Antoine Girbal notifications@github.com a écrit :

Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:

you collect data from twitter api with a Java app and write it to mongodb. I'm assuming all strings are handled as UTF from UMongo, you do a find(), can you see the characters properly in the window? from UMongo, you do exports, do those look correct, if not which one is broken? thanks PS: sounds interesting :) — Reply to this email directly or view it on GitHub.

M. Cyrille Papon Thesis candidate - University of Toulon - thesis school n°509 - I3M & IRENav laboratories - FMES Projet MAR2SI - Modèle d'analyse du renseignement des réseaux sociaux de l'Internet AMISNI Project - Analysis model of Internet social networks intelligence Domain: Social Network Analysis for Intelligence in support of crisis management email: social.network.intel@me.com Skype: clocloauboulot Iphone: +33 (0)682 469 629

agirbal commented 10 years ago

could you just extract a few documents that dont work, instead of the full dump? This way it will be smaller. Or at least you can gzip it. Then pls send it through email to antoine@mongodb.com

SONEINT commented 10 years ago

Antoine,

I have checked on several export types files. Here I copy a few 5 tweets in different types files (BSON,JSON,CSV) with 2 kinds of search parameter on Twitter API : english "#Aleppo" search and arabic "#Aleppo" search. I have added you the format exported directly from MongoDB with rmongodb package on R software environment (I called it RMONGODB type).

###################################

Tweet 1 : english text

BSON

1_idR…°∆6w@ñE°U∂user_nameced_labretweet_counttweet_followers_countksourcewebtweet_mentioned_counttweet_IDqôƒ”tweet_textâThe CHRONICLE EYE : Ahrar al-#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings (#Aleppo) http://t.co/TLiQPIT4V6L

JSON

{ "_id" : { "$oid" : "52c9a1c63677409645a155b6"} , "user_name" : "ced_lab" , "retweet_count" : 0 , "tweet_followers_count" : 875 , "source" : "web" , "tweet_mentioned_count" : 0 , "tweet_ID" : 419895353580589056 , "tweet_text" : "The CHRONICLE EYE : Ahrar al-#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings (#Aleppo) http://t.co/TLiQPIT4V6"}

RMONGODB

_id user_name retweet_count tweet_followers_count source tweet_mentioned_count tweet_ID 1 1 ced_lab 0 875 web 0 419895353580589056 tweet_text 1 The CHRONICLE EYE : Ahrar al-#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings (#Aleppo) http://t.co/TLiQPIT4V6

Tweet 2 : arabic text

BSON

_idR…°∆6w@ñE°U∑user_name aleppo_mediaretweet_counttweet_followers_countsourcewebtweet_mentioned_counttweet_IDpàSòƒ”tweet_textüRT @abohasan_1: ÿ¨ÿ®Ÿáÿ© ÿߟџÜÿµÿ±ÿ© (ŸÖŸáÿßÿ¨ÿ±ŸäŸÜŸáÿß Ÿàÿ£ŸÜÿµÿßÿ±Ÿáÿß ) ŸÖŸÇÿ±ÿßÿ™Ÿáÿß ŸÖŸÉÿßŸÜ ÿ¢ŸÖŸÜ ŸÑŸÉŸÑ ŸÖŸÜ ŸäÿÆÿ¥Ÿâ ÿπŸÑŸâ ŸÜŸÅÿ≥Ÿá ÿߟÑÿ¢ÿ∞Ÿâ .V

JSON

{ "_id" : { "$oid" : "52c9a1c63677409645a155b7"} , "user_name" : "aleppo_media" , "retweet_count" : 24 , "tweet_followers_count" : 6 , "source" : "web" , "tweet_mentioned_count" : 1 , "tweet_ID" : 419895348791111680 , "tweet_text" : "RT @abohasan_1: ???? ?????? (????????? ???????? )\n ??????? ???? ??? ??? ?? ???? ??? ???? ????? ."}

RMONGODB

_id user_name retweet_count tweet_followers_count source tweet_mentioned_count tweet_ID 2 1 aleppo_media 24 6 web 1 419895348791111680 tweet_text User_id 2 RT @abohasan_1: جبهة النصرة (مهاجرينها وأنصارها )\n مقراتها مكان آمن لكل من يخشى على نفسه الآذى .

Tweet 3 : arabic text

BSON

_idR…°∆6w@ñE°U∏user_name Dream_alepporetweet_counttweet_followers_countsourceAGharedlyComtweet_mentioned_counttweet_ID0Hflàƒ”tweet_textlÿߟџџáŸÖ ÿßÿ±ÿ≤ŸÇŸÜÿß ÿߟÑÿπŸÅŸà Ÿà ÿߟÑÿπÿߟşäÿ© ÿ®ÿߟÑÿØŸÜŸäÿß Ÿà ÿߟÑÿ¢ÿÆÿ±ÿ© http://t.co/nz5cBxBfvip

JSON

{ "_id" : { "$oid" : "52c9a1c63677409645a155b8"} , "user_name" : "Dream_aleppo" , "retweet_count" : 0 , "tweet_followers_count" : 29 , "source" : "<a href=\"http://www.Gharedly.com\" rel=\"nofollow\">GharedlyCom" , "tweet_mentioned_count" : 0 , "tweet_ID" : 419895282416234497 , "tweet_text" : "????? ?????? ????? ? ??????? ??????? ? ?????? http://t.co/nz5cBxBfvi"}

RMONGODB

  _id    user_name retweet_count tweet_followers_count

3 97216448 Dream_aleppo 0 29 source tweet_mentioned_count tweet_ID 3 GharedlyCom 0 419895282416234496 tweet_text User_id User_Name Creation_date 3 اللهم ارزقنا العفو و العافية بالدنيا و الآخرة http://t.co/nz5cBxBfvi

Tweet 4 : arabic text

BSON

_idR…°∆6w@ñE°Uπuser_namevoice_of_alepporetweet_counttweet_followers_countrsourceFFacebooktweet_mentioned_counttweet_IDà∞wƒ”tweet_text~ÿπŸàÿØÿ© ÿߟџÖÿßÿ° ÿߟџâ ÿ®ÿπÿ∂ ŸÖŸÜÿßÿ∑ŸÇ ŸÖÿØŸäŸÜÿ© ÿ≠ŸÑÿ®

الي اجته المي يكتبلنا اسم المنطقةI

JSON

{ "_id" : { "$oid" : "52c9a1c63677409645a155b9"} , "user_name" : "voice_of_aleppo" , "retweet_count" : 0 , "tweet_followers_count" : 114 , "source" : "<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook" , "tweet_mentioned_count" : 0 , "tweet_ID" : 419895208617443328 , "tweet_text" : "???? ????? ??? ??? ????? ????? ???\n\n??? ???? ???? ??????? ??? ???????"}

RMONGODB

_id user_name retweet_count tweet_followers_count 4 0 voice_of_aleppo 0 114 source tweet_mentioned_count tweet_ID 4 Facebook 0 419895208617443328 tweet_text User_id User_Name Creation_date 4 عودة الماء الى بعض مناطق مدينة حلب\n\nالي اجته المي يكتبلنا اسم المنطقة

Tweet 5 : english text

BSON

_idR…°∆6w@ñE°U∫user_nameMyNetLikeMUSICretweet_counttweet_followers_countQsourceIWin the Customertweet_mentioned_counttweet_IDPBeƒ”tweet_textUSyria: Helicopters Drop 'Barrel Bombs' On Aleppo... http://t.co/g5zoItTM5uá

JSON

{ "_id" : { "$oid" : "52c9a1c63677409645a155ba"} , "user_name" : "MyNetLikeMUSIC" , "retweet_count" : 0 , "tweet_followers_count" : 1361 , "source" : "<a href=\"http://winthecustomer.com/\" rel=\"nofollow\">Win the Customer" , "tweet_mentioned_count" : 0 , "tweet_ID" : 419895130481381377 , "tweet_text" : "Syria: Helicopters Drop 'Barrel Bombs' On Aleppo... http://t.co/g5zoItTM5u"}

RMONGODB

_id user_name retweet_count tweet_followers_count 5 0 MyNetLikeMUSIC 0 1361 source tweet_mentioned_count 5 Win the Customer 0 tweet_ID tweet_text User_id 5 419895130481381376 Syria: Helicopters Drop 'Barrel Bombs' On Aleppo... http://t.co/g5zoItTM5u

###################################

Kind regards,

Cyrille.

Le 6 juin 2014 à 00:11, Antoine Girbal notifications@github.com a écrit :

could you just extract a few documents that dont work, instead of the full dump? This way it will be smaller. Or at least you can gzip it. Then pls send it through email to antoine@mongodb.com

— Reply to this email directly or view it on GitHub.

M. Cyrille Papon Thesis candidate - University of Toulon - thesis school n°509 - I3M & IRENav laboratories - FMES Projet MAR2SI - Modèle d'analyse du renseignement des réseaux sociaux de l'Internet AMISNI Project - Analysis model of Internet social networks intelligence Domain: Social Network Analysis for Intelligence in support of crisis management email: social.network.intel@me.com Skype: clocloauboulot Iphone: +33 (0)682 469 629

SONEINT commented 10 years ago

Antoine,

I have just sent you a gzip extract of a JSON export of a small dataset to your email antoine@mongodb.com ; I will send you later the gzip extract of the other format export (BSON might be interesting) of the same dataset later in the day.

Thank you for all.

Cyrille.

SONEINT commented 10 years ago

Hum, files are too big through email, I have just sent you an invitation to a common dropbox repertory to allow you to download the files.

Cyrille.

agirbal commented 10 years ago

thanks for the upload, I will check it out soon

agirbal commented 10 years ago

I am not able to reproduce :( What I did:

Do you have an id of a document that is getting messed up? Also can you please make sure you're using umongo 1.6.2

agirbal commented 10 years ago

Ok I tried with umongo 1.4.3 and it was messed up. Looks like either a bug in umongo or java driver that was fixed. Can you confirm?

agirbal commented 10 years ago

Closing unless you say the problem is still there, thx!