dselivanov / rmongodb

R driver for MongoDB
53 stars 26 forks source link

rmongodb : Data types not preserved in R dataframe when MongoDB documents' fields take character values #76

Closed ajayram198 closed 9 years ago

ajayram198 commented 9 years ago

When we import data in R from MongoDB using mongo.find.all or mongo.find.batch function of rmongodb package, the original data types of fields defined in MongoDB are not preserved because fields take character value in one or more documents. (e.g. "null"). After importing MongoDB collection containing such fields and converting it into R data frames, it will consider such fields as character variables instead of original data types from MongoDB. To preserve this original data types, we have to first replace "null" values by NA's. How to replace these "null" values by NA's to preserve the original data types from MongoDB while importing MongoDB collection itself.

This typed feature is already available in R when we import data from CSV files. We just need to use na.strings = "null" argument in read.csv function for this purpose as follows.

 data<-read.csv(file.choose(),sep = ',',header = TRUE,na.strings='null')

Though the particular variables containing "null" values in Excel sheet, after using above function it replaces null values by NA values and considers its appropriate data types.

For illustration of this problem, we will consider a sample collection with 5 fields which takes null values in one or more documents. Lets import data in R using mongo.find.all function and convert it into R data frame. Following is the screenshot for R data frame of sample collection.

sampledocfieldsnull

Now if we observe the values for this data frame, it seems that all columns have numeric datatype. and take null value at first document. But if we check the classes of individual fields, it shows character datatype, though its datatype was defined as Double in MongoDB. Following is the screenshot for the same:

sampledocfieldsnull

Ideally, this field should have been numeric after importing. So we see that when the field has null values, the original data types of the fields have been lost and it gets converted to character data type everywhere.

Anticipating for early response.

dselivanov commented 9 years ago

Hi, Ajay. Of cource when you construct data.frame and you have diffrent types in the same field mongo.find.all coerces them to the highest type. This is correct behaviour and mongo.find.all do the best it can.
This is definetely not a rmongodb bug. This a problem of how you store your data. MongoDB provide great flexibility so you can keep diffrent objects (with diffrent types!) in field with same name. But this flexibility also require correct data handling. So if you keep data in such messy way you should:

  1. get list using mongo.cursor.to.list
  2. replace your "null"s to "NAreal" (for example using lapply )
  3. construct data.frame from list manually.

please see source code for mongo.find.all