Closed fgreg closed 7 years ago
In order to avoid the problem from issue https://github.com/mudrod/mudrod/issues/167 I just moved the ssl_* files out of the data directory before running full ingest.
I get a similar error when trying to process 02/2017 logs
2017-05-09 22:56:30,414 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:<init>(51)) - Started Mudrod Weblog Discovery Engine.
2017-05-09 22:56:30,414 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(98)) - Starting Web log preprocessing.
2017-05-09 22:56:30,415 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(106)) - Processing logs dated 201702.gz
2017-05-09 22:56:30,563 INFO pre.ImportLogFile (ImportLogFile.java:execute(74)) - Starting Log Import 201702.gz
2017-05-09 22:56:31,556 INFO util.Version (Version.java:logVersion(108)) - Elasticsearch Hadoop v5.2.0 [d85a257f9f]
2017-05-09 22:56:31,890 INFO rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201702.gz/rawhttp]
2017-05-09 23:00:07,458 INFO rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201702.gz/rawftp]
2017-05-09 23:20:56,001 INFO pre.ImportLogFile (ImportLogFile.java:execute(78)) - Log Import complete. Time elapsed 1465 seconds
2017-05-09 23:20:56,007 INFO pre.CrawlerDetection (CrawlerDetection.java:execute(93)) - Starting Crawler detection rawhttp.
2017-05-09 23:20:58,531 INFO pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(159)) - Original User count: 6658
2017-05-09 23:22:14,913 INFO pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(177)) - User count: 6610
2017-05-09 23:22:14,914 INFO pre.CrawlerDetection (CrawlerDetection.java:execute(102)) - Crawler detection complete. Time elapsed 78 seconds
2017-05-09 23:22:14,918 INFO pre.SessionGenerator (SessionGenerator.java:execute(66)) - Starting Session Generation.
2017-05-09 23:24:17,775 INFO pre.SessionGenerator (SessionGenerator.java:genSessionByRefererInParallel(224)) - Initial Session count: 1082274
2017-05-09 23:38:25,991 INFO pre.SessionGenerator (SessionGenerator.java:execute(71)) - Session generating complete. Time elapsed 971 seconds.
2017-05-09 23:38:25,995 INFO pre.SessionStatistic (SessionStatistic.java:execute(66)) - Starting Session Summarization.
2017-05-09 23:39:50,255 INFO pre.SessionStatistic (SessionStatistic.java:processSessionInParallel(172)) - Final Session count: 583
2017-05-09 23:39:50,257 INFO pre.SessionStatistic (SessionStatistic.java:execute(79)) - Session Summarization complete. Time elapsed 84 seconds.
2017-05-09 23:39:50,259 INFO pre.RemoveRawLog (RemoveRawLog.java:execute(41)) - Starting raw log removal.
2017-05-09 23:53:10,067 INFO pre.RemoveRawLog (RemoveRawLog.java:execute(47)) - Raw log removal complete. Time elapsed 799 seconds.
2017-05-09 23:53:10,067 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(125)) - Web log preprocessing for logs dated 201702.gz complete. Time elapsed 3399 seconds.
2017-05-09 23:53:10,075 INFO pre.HistoryGenerator (HistoryGenerator.java:execute(47)) - Starting HistoryGenerator...
/usr/local/sdeploy/mudrod-ingest/ftp-mirror/2017/02/UserHistoryMatrix.csv
podaaclog
2017-05-09 23:53:10,228 INFO pre.HistoryGenerator (HistoryGenerator.java:execute(53)) - HistoryGenerator complete. Time elapsed 0 seconds
2017-05-09 23:53:10,229 INFO pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(48)) - Starting ClickStreamGenerator...
podaaclog
click stream number:861
2017-05-09 23:53:12,269 INFO pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(65)) - ClickStreamGenerator complete. Time elapsed 2 seconds.
2017-05-09 23:53:12,270 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(134)) - Web log preprocessing (user history and clickstream) complete.
2017-05-09 23:53:12,270 INFO discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:process(182)) - Starting Web log processing.
2017-05-09 23:53:12,271 INFO process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(49)) - Starting ClickStreamAnalyzer...
2017-05-09 23:53:13,113 WARN distributed.RowMatrix (Logging.scala:logWarning(66)) - The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
2017-05-09 23:53:13,199 WARN netlib.ARPACK (ARPACK.java:<clinit>(61)) - Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
2017-05-09 23:53:13,285 INFO jni.JniLoader (JniLoader.java:liberalLoad(90)) - successfully loaded /tmp/jniloader2693392545934265752netlib-native_ref-linux-x86_64.so
2017-05-09 23:53:15,222 WARN distributed.RowMatrix (Logging.scala:logWarning(66)) - The input data was not directly cached, which may hurt performance if its parent RDDs are also uncached.
2017-05-09 23:53:17,278 INFO process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(66)) - ClickStreamAnalyzer complete. Time elapsed: 5s
2017-05-09 23:53:17,280 INFO process.UserHistoryAnalyzer (UserHistoryAnalyzer.java:execute(47)) - Starting UserHistoryAnalyzer...
usage: MudrodEngine: 'dataDir' argument is mandatory. User must also
provide an ingest method. -dataDir </path/to/data/directory>
[-esHost <host_name>] [-esPort <port_num>] [-esTCPPort <port_num>]
[-f] [-h] [-l] [-m] [-p]
-dataDir,--dataDirectory </path/to/data/directory> the data
directory to be
processed by
Mudrod
-esHost,--elasticSearchHost <host_name> elasticsearch
cluster unicast
host
-esPort,--elasticSearchHTTPPort <port_num> elasticsearch
HTTP/REST port
-esTCPPort,--elasticSearchTransportTCPPort <port_num> elasticsearch
transport TCP
port
-f,--fullIngest begin full ingest
Mudrod workflow
-h,--help show this help
message
-l,--logIngest begin log ingest
-m,--metaIngest begin metadata
ingest
-p,--processingWithPreResults begin processing
with
preprocessing
results
2017-05-09 23:53:17,344 ERROR main.MudrodEngine (MudrodEngine.java:main(398)) - Error whilst parsing command line.
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1027)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.computeSize(CoordinateMatrix.scala:151)
at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.numCols(CoordinateMatrix.scala:58)
at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.transpose(CoordinateMatrix.scala:75)
at gov.nasa.jpl.mudrod.utils.MatrixUtil.transposeMatrix(MatrixUtil.java:411)
at gov.nasa.jpl.mudrod.utils.SimilarityUtil.calculateSimilarityFromVector(SimilarityUtil.java:63)
at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:66)
at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:60)
at gov.nasa.jpl.mudrod.weblog.process.UserHistoryAnalyzer.execute(UserHistoryAnalyzer.java:51)
at gov.nasa.jpl.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:189)
at gov.nasa.jpl.mudrod.main.MudrodEngine.startFullIngest(MudrodEngine.java:242)
at gov.nasa.jpl.mudrod.main.MudrodEngine.main(MudrodEngine.java:388)
@fgreg Could you please send the HTTP and FTP logs of 201703 to me or attached them to the comment? I can only download ssl_WWW.201703.gz from the link ftp://podaac.jpl.nasa.gov/misc/outgoing/cjf/mudrod/2017/03
@quintinali you can use the 02/2017 logs to reproduce this problem as well. If we get the 'ssl_' prefix sorted I'm sure the ssl logs will exhibit the same behavior.
The ssl logs are the ones we should be ingesting. If you need to test without fixing the prefix problem https://github.com/mudrod/mudrod/issues/167 , just change the name and remove 'ssl'
Fixed by PR https://github.com/mudrod/mudrod/pull/171
Ingesting 03/2017 logs does not work. Produces an empty
UserHistoryMatrix.csv
and does not createClickstreamMatrix.csv
: