RevolutionAnalytics / rhdfs

A package that allows R developers to use Hadoop HDFS
64 stars 73 forks source link cannot load all data from huge csv file on hdfs #8

Open strategist922 opened 10 years ago

strategist922 commented 10 years ago

Hi, I have many huge csv files(more 20GB) on my hortonworks HDP GA cluster, I use the following code to read file from HDFS:

Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-") Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/") library(rmr2); library(rhdfs); library(lubridate); hdfs.init(); f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600); m =; c = rawToChar(m); data = read.table(textConnection(c), sep = ",");

When I use dim(data) to verify, it show me as following: [1] 1523 7

But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49 ...", and there is 

a thread in hadoop-hdfs-user mailing list(why can only read 2^17 bytes in hadoop2.0?) . Ref.

Is it a bug of in rhdfs-1.0.8?

Best Regards, James Chang