amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

"Data exploration with Spark" refers to console log output which I don't see #195

Open gostevehoward opened 8 years ago

gostevehoward commented 8 years ago

http://www.cs.berkeley.edu/~jey/ampcamp6/training/data-exploration-using-spark.html

e.g.,

"If you look closely at the terminal, the console log is pretty chatty and tells you the progress of the tasks."

"If you examine the console log closely, you will see lines like this, indicating some data was added to the cache"

But my console seems to be hiding the detailed log output:

17:46 steve@fisher:~/work/ampcamp6/ampcamp6$ spark/bin/pyspark 
Python 2.7.9 (default, Apr  2 2015, 15:33:21) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/steve/work/ampcamp6/ampcamp6/spark/lib/ampcamp-keystoneml.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/steve/work/ampcamp6/ampcamp6/spark/lib/spark-assembly-1.5.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.9 (default, Apr  2 2015 15:33:21)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x7efd7d5eac10>
>>> pagecounts = sc.textFile('data/pagecounts')
>>> pagecounts
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
>>> pagecounts.take(10)
[u'20090507-040000 aa Main_Page 7 51309', u'20090507-040000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 34069', u'20090507-040000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 3 65763', u'20090507-040000 af.b Tuisblad 1 36231', u'20090507-040000 af.d Tuisblad 1 58960', u'20090507-040000 af.q Tuisblad 1 44265', u'20090507-040000 af Afrikaans 3 80838', u'20090507-040000 af Australi%C3%AB 1 132433', u'20090507-040000 af Ensiklopedie 2 60584', u'20090507-040000 af Internet 1 48816']
>>> print '\n'.join(pagecounts.take(10))
20090507-040000 aa Main_Page 7 51309
20090507-040000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 34069
20090507-040000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 3 65763
20090507-040000 af.b Tuisblad 1 36231
20090507-040000 af.d Tuisblad 1 58960
20090507-040000 af.q Tuisblad 1 44265
20090507-040000 af Afrikaans 3 80838
20090507-040000 af Australi%C3%AB 1 132433
20090507-040000 af Ensiklopedie 2 60584
20090507-040000 af Internet 1 48816
>>> pagecounts.count()
1398882                                                                         
>>> enPages = pagecounts.filter(lambda x: x.split(' ')[1] == 'en').cache()
>>> enPages.count()
970545                                                                          
>>> enPages.count()
970545