Closed SONEINT closed 10 years ago
Sorry for the delay on this, looking into it
It looks like the shell doesnt like the arabic UTF8, it screws up the prompt... I'll try to insert from umongo. Otherwise, any chance you could send me the BSON docs that fail?
It looks like I am able to insert and export the following documents fine. The text should really be specified in UTF8, otherwise bad things will happen, both in mongo drivers and the shell. UMongo mostly deals with Java strings (~UTF16) as passed by the driver, so there is not much to fix there, as long as the driver can (de)serialize them. On import / export to JSON / BSON it is assumed that the strings are UTF8.
{ "_id" : { "$oid" : "538ba9a9add22126d4c5bc50"} , "str" : "تشكيل نهاية حدة إذ. قد وإيطالي المتساقطة، دون, مع بزوال بينما وفي. مع دحر التخطيط والروسية, للجزر أوكيناوا وبريطانيا من قصف. جوي عل وأكثرها بريطانيا-فرنسا."} { "_id" : { "$oid" : "538bab2b3004255058e8d6f5"} , "str" : "أنا قادر على أكل الزجاج و هذا لا يؤلمني."}
Maybe you could send me documents that are broken thx
Hi,
I have tested with several DATA sets, with same results. I made Umongo choice for big datasets, on which rmongodb not gave me very good results (probably to big & my PC with only 16Go of RAM - extraction stopped on R after 40000 rows, not constant). But rmongodb worked properly with relative small DATA sets with Arabic UTF8, and gave me the opportunity to analyze the DATA on R.
I am not at home for a couple of days. I will have a look on tuesday evening to see if I can share with you a representative DATA set, perhaps through my Github account.
Cyrille Skype: clocloauboulot PS: working to get a PhD degree on social network analysis & Terrorism studies.
Le 2 juin 2014 à 00:35, Antoine Girbal notifications@github.com a écrit :
It looks like the shell doesnt like the arabic UTF8, it screws up the prompt... I'll try to insert from umongo. Otherwise, any chance you could send me the BSON docs that fail?
— Reply to this email directly or view it on GitHub.
OK. I use a JAVA program to collect DATA from Twitter API. I assume that if rmongodb do it properly on small DATA sets, it should be fine on Umongo ? Where can I check the tweets format on Umongo ? I can read the terms in Arabic in all the Umongo windows.
Cyrille
Le 2 juin 2014 à 00:43, Antoine Girbal notifications@github.com a écrit :
It looks like I am able to insert and export the following documents fine. The text should really be specified in UTF8, otherwise bad things will happen, both in mongo drivers and the shell. UMongo mostly deals with Java strings (~UTF16) as passed by the driver, so there is not much to fix there, as long as the driver can (de)serialize them. On import / export to JSON / BSON it is assumed that the strings are UTF8.
{ "_id" : { "$oid" : "538ba9a9add22126d4c5bc50"} , "str" : "تشكيل نهاية حدة إذ. قد وإيطالي المتساقطة، دون, مع بزوال بينما وفي. مع دحر التخطيط والروسية, للجزر أوكيناوا وبريطانيا من قصف. جوي عل وأكثرها بريطانيا-فرنسا."} { "_id" : { "$oid" : "538bab2b3004255058e8d6f5"} , "str" : "أنا قادر على أكل الزجاج و هذا لا يؤلمني."}
Maybe you could send me documents that are broken thx
— Reply to this email directly or view it on GitHub.
Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:
You are right. Sorry for my poor english, I need to progress :-). Anyway :
Thanks for all.
Cyrille.
Le 2 juin 2014 à 03:19, Antoine Girbal notifications@github.com a écrit :
Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:
you collect data from twitter api with a Java app and write it to mongodb. I'm assuming all strings are handled as UTF from UMongo, you do a find(), can you see the characters properly in the window? from UMongo, you do exports, do those look correct, if not which one is broken? thanks PS: sounds interesting :) — Reply to this email directly or view it on GitHub.
Hi,
Sorry just come back home. I can push you a small DafaSet to test your hand. How can I do that (I have exported small datasets around 75 Mo each, not possible by email). Seems that on my hand arabic terms are correctly displayed in Umongo in the Result panel after a find function. When you select text version on Umngo, arabic are replaced by ???? When I export to all format, on Windows or Mac OS X, analysis with R with different functions (data.table, et...) gives me ??? in place of arabic terms.
Cyrille.
Le 2 juin 2014 à 07:20, Cyrille Papon social.network.intel@me.com a écrit :
You are right. Sorry for my poor english, I need to progress :-). Anyway :
- Data format are readable in Umongo & command find works properly, finding terms in UTF8 format for arabic.
- I'll check one more time today with the exports I have done a few weeks ago. I have changed R version, and learned new R tricks to clean DATA sets on R with different packages.
- If unsuccessful, I will send you a sample tomorrow night.
Thanks for all.
Cyrille.
Le 2 juin 2014 à 03:19, Antoine Girbal notifications@github.com a écrit :
Cyrille, Im a bit confused of what is broken. Let's forget rmongo for a minute and only consider mongodb and UMongo. From what you are saying:
you collect data from twitter api with a Java app and write it to mongodb. I'm assuming all strings are handled as UTF from UMongo, you do a find(), can you see the characters properly in the window? from UMongo, you do exports, do those look correct, if not which one is broken? thanks PS: sounds interesting :) — Reply to this email directly or view it on GitHub.
M. Cyrille Papon Thesis candidate - University of Toulon - thesis school n°509 - I3M & IRENav laboratories - FMES Projet MAR2SI - Modèle d'analyse du renseignement des réseaux sociaux de l'Internet AMISNI Project - Analysis model of Internet social networks intelligence Domain: Social Network Analysis for Intelligence in support of crisis management email: social.network.intel@me.com Skype: clocloauboulot Iphone: +33 (0)682 469 629
could you just extract a few documents that dont work, instead of the full dump? This way it will be smaller. Or at least you can gzip it. Then pls send it through email to antoine@mongodb.com
Antoine,
I have checked on several export types files. Here I copy a few 5 tweets in different types files (BSON,JSON,CSV) with 2 kinds of search parameter on Twitter API : english "#Aleppo" search and arabic "#Aleppo" search. I have added you the format exported directly from MongoDB with rmongodb package on R software environment (I called it RMONGODB type).
###################################
1 _id R…°∆6w@ñE°U∂user_name ced_lab retweet_count tweet_followers_count k source web tweet_mentioned_count tweet_ID qôƒ”tweet_text â The CHRONICLE EYE : Ahrar al-#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings (#Aleppo)
MongoDB : 4.2.2 version Data Mining program : Intellij4Idea JAVA programm with Twitter4j library Aim : Data mining Twitter, storage in MongoDB, analysis in R Utility of Umongo : limitations of rmongdb library in R with large data sets
I have a problem with export function in Umongo with tweets in arabic :
1/ BSON format appears to be not readable in text format on Windows or MacOSX ; arabic terms could be detected in R with text mining clustering algorihms ; BSON files are not readable in R (data and metadata) ;
Ex: One tweet in english (OK) and one tweet in arabic (KO here but text of the tweet readable on R)
_id RÉÎM6w¡ÿZ‡ sleymen71 retweet_count tweet_followers_count ~ source P TweetCaster for Android tweet_mentioned_count tweet_ID H§ðÓtweet_text ? RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4 O
_id RÉÎM6w¡ÿZ‡¡user_name InnssaannI retweet_count tweet_followers_count ñ source web tweet_mentioned_count tweet_ID `ˆâðÓtweet_text ¤ RT @fahadjabbar1: داعش وبيان Øقيقتهم Ø› كلمه موجزه ورائعه للشيخ عبدالعزيز الÙوزان عن http://t.co/DFFJktaVmw
2/ JSON format appears to be readable but arabic terms are not understandablein text format on Windows or MacOSX ; arabic terms could not be detected in R with text mining clustering algorihms ; JSON files are readable in R (except arabic text of the tweets) ;
Ex: One tweet in english (OK) and one tweet in arabic (KO with ????)
{ "_id" : { "$oid" : "52c9ce4d36771ea1ff5a87a0"} , "user_name" : "sleymen71" , "retweet_count" : 5 , "tweet_followers_count" : 126 , "source" : "<a href=\"http://www.tweetcaster.com\" rel=\"nofollow\">TweetCaster for Android" , "tweet_mentioned_count" : 1 , "tweet_ID" : 419943196131856384 , "tweet_text" : "RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4"}
{ "_id" : { "$oid" : "52c9ce4d36771ea1ff5a87a1"} , "user_name" : "InnssaannI" , "retweet_count" : 7 , "tweet_followers_count" : 753 , "source" : "web" , "tweet_mentioned_count" : 1 , "tweet_ID" : 419943192830959616 , "tweet_text" : "RT @fahadjabbar1: ???? ????? ??????? ? ???? ????? ?????? ????? ????????? ??????? ?? http://t.co/DFFJktaVmw"}
Ex: One tweet in english (OK) and one tweet in arabic (KO with ????)
{ "$oid" : "52c9ce4d36771ea1ff5a87a0"},"sleymen71",5,126,"<a href=\"http://www.tweetcaster.com\" rel=\"nofollow\">TweetCaster for Android",1, null ,"RT @FiratGunay: Do we all agree? #syria http://t.co/SYEhIb8lo4"
{ "$oid" : "52c9ce4d36771ea1ff5a87a1"},"InnssaannI",7,753,"web",1, null ,"RT @fahadjabbar1: ???? ????? ??????? ? ???? ????? ?????? ????? ????????? ??????? ?? http://t.co/DFFJktaVmw"
Conclusion : export is not coherent here and I can not analyse my data sets on R for data mining and machine learning.
Cyrille.