RevolutionAnalytics / rhdfs

A package that allows R developers to use Hadoop HDFS
64 stars 73 forks source link

hdfs.read() cannot load all data from huge csv file on hdfs #8

Open strategist922 opened 10 years ago

strategist922 commented 10 years ago

Hi, I have many huge csv files(more 20GB) on my hortonworks HDP 2.0.6.0 GA cluster, I use the following code to read file from HDFS:


Sys.setenv(HADOOP_CMD="/usr/lib/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar") Sys.setenv(HADOOP_COMMON_LIB_NATIVE_DIR="/usr/lib/hadoop/lib/native/") library(rmr2); library(rhdfs); library(lubridate); hdfs.init(); f = hdfs.file("/etl/rawdata/201202.csv","r",buffersize=104857600); m = hdfs.read(f); c = rawToChar(m); data = read.table(textConnection(c), sep = ",");


When I use dim(data) to verify, it show me as following: [1] 1523 7


But actually, it should be "134279407" instead of "1523".
I found the value of m show in RStudio is "raw [1:131072] 50 72 69 49 ...", and there is 

a thread in hadoop-hdfs-user mailing list(why can FSDataInputStream.read() only read 2^17 bytes in hadoop2.0?) . Ref. http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201403.mbox/%3CCAGkDawm2ivCB+rNaMi1CvqpuWbQ6hWeb06YAkPmnOx=8PqbNGQ@mail.gmail.com%3E

Is it a bug of hdfs.read() in rhdfs-1.0.8?

Best Regards, James Chang