Sotera / GEQE

Geo Event Quey by Example - Leverage geo-located temporal text data in order to identify similar locations or events.
http://sotera.github.io/GEQE
The Unlicense
8 stars 7 forks source link

findSimilarPlaces.py fails no tweets in ROI #54

Closed haleystorm closed 8 years ago

haleystorm commented 8 years ago

running through the steps in GEQE/geqe-ml/README and everything seems to work fine until I get to findSimilarPlaces.py which fails with error

Traceback (most recent call last): File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 219, in strStop=strStop) File "/home/haley/dev/projects/GEQE/geqe-ml/findSimilarPlaces.py", line 137, in run model_Tree = RandomForest.trainRegressor(mlTrain.map(lambda x: x[1][0]), categoricalFeaturesInfo={}, numTrees=100, featureSubsetStrategy="auto", impurity="variance", maxDepth=4, maxBins=32) File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 262, in _train File "/home/haley/dev/apps/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py", line 1320, in first ValueError: RDD is empty

Noticed in the log that no tweets are being found in the ROI and that is causing the error.

from log ---------- Time to find in and out of ROI 3.46856999397 N in: 0 , N out: 13

I have 70k tweets in the cleveland area. Is my dataset too small? I'm streaming tweets with locations=[-81.7108,41.4458,-81.4328,41.5312]... Is that too small an area?

haleystorm commented 8 years ago

Closing this issue. It is a duplicate of issue #11