dawncc / TensorFlowTest

0 stars 0 forks source link

Spark+Hadoop #7

Open dawncc opened 7 years ago

dawncc commented 7 years ago

安装Hadoop

http://www.powerxing.com/install-hadoop/

dawncc commented 7 years ago

启动Hadoop

/usr/local/hadoop# ./sbin/start-dfs.sh

访问地址:

http://120.24.38.209:50070/
dawncc commented 7 years ago

启动Spark

./sbin/start-master.sh

http://120.24.38.209:8080/

dawncc commented 7 years ago

Spark读取HDFS数据

1. 上传文件到HDFS

         hadoop fs -ls  /    查看hdfs的根目录下的内容的
         hadoop fs -lsr /    递归查看hdfs的根目录下的内容的
         hadoop fs -mkdir /d1    在hdfs上创建文件夹d1
         hadoop fs -put <linux source> <hdfs destination> 把数据从linux上传到hdfs的特定路径中
         hadoop fs -get <hdfs source> <linux destination> 把数据从hdfs下载到linux的特定路径下
         hadoop fs -text <hdfs文件>    查看hdfs中的文件
         hadoop fs -rm        删除hdfs中文件
         hadoop fs -rmr    删除hdfs中的文件夹

2. 编写spark-shell脚本

from pyspark import SparkContext  

inputFile = 'hdfs://localhost:9000/user/hadoop/test*'        #测试文档  
outputFile = 'hdfs://localhost:9000/user/hadoop/spark-out'    #结果目录  

sc = SparkContext('local', 'dfs[a-z.]+')  
text_file = sc.textFile(inputFile)  

counts = text_file.flatMap(lambda line: line.split(' ')).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)  
counts.saveAsTextFile(outputFile)

3. 执行脚本

$SPARK_HOME/bin/spark-submit wordcount.py  

4. 查看执行结果

bin/hdfs dfs -cat /user/hadoop/spark-out/*  
('', 2505)
('of', 66)
('an', 8)
('CONDITIONS', 8)
('limitations', 8)
('accompanying', 4)
('file.', 7)
('<name>yarn.scheduler.capacity.maximum-applications</name>', 1)
('ResourceCalculator', 1)
。。。