running through the steps in GEQE/geqe-ml/README and everything seems to work fine until I get to findSimilarPlaces.py which fails with error
Traceback (most recent call last):
File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 219, in
strStop=strStop)
File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 137, in run
model_Tree = RandomForest.trainRegressor(mlTrain.map(lambda x: x[1][0]), categoricalFeaturesInfo={}, numTrees=100, featureSubsetStrategy="auto", impurity="variance", maxDepth=4, maxBins=32)
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 262, in _train
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 1320, in first
ValueError: RDD is empty
Noticed in the log that no tweets are being found in the ROI and that is causing the error.
from log ----------
Time to find in and out of ROI 3.46856999397
N in: 0 , N out: 13
I have 70k tweets in the cleveland area. Is my dataset too small? I'm streaming tweets with locations=[-81.7108,41.4458,-81.4328,41.5312]... Is that too small an area?
running through the steps in GEQE/geqe-ml/README and everything seems to work fine until I get to findSimilarPlaces.py which fails with error
Traceback (most recent call last): File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 219, in
strStop=strStop)
File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 137, in run
model_Tree = RandomForest.trainRegressor(mlTrain.map(lambda x: x[1][0]), categoricalFeaturesInfo={}, numTrees=100, featureSubsetStrategy="auto", impurity="variance", maxDepth=4, maxBins=32)
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 262, in _train
File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 1320, in first
ValueError: RDD is empty
Noticed in the log that no tweets are being found in the ROI and that is causing the error.
from log ---------- Time to find in and out of ROI 3.46856999397 N in: 0 , N out: 13
I have 70k tweets in the cleveland area. Is my dataset too small? I'm streaming tweets with locations=[-81.7108,41.4458,-81.4328,41.5312]... Is that too small an area?