aist-oceanworks / mudrod

Mining and Utilizing Dataset Relevancy from Oceanographic Datasets to Improve Data Discovery and Access, online demo: https://mudrod.jpl.nasa.gov/#/
https://mudrod.github.io/
15 stars 15 forks source link

Full Ingest 03/2017 logs fails #168

Closed fgreg closed 7 years ago

fgreg commented 7 years ago

Ingesting 03/2017 logs does not work. Produces an empty UserHistoryMatrix.csv and does not create ClickstreamMatrix.csv:

2017-05-09 21:43:38,665 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:<init>(51)) - Started Mudrod Weblog Discovery Engine.
2017-05-09 21:43:38,665 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(98)) - Starting Web log preprocessing.
2017-05-09 21:43:38,666 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(106)) - Processing logs dated 201703.gz
2017-05-09 21:43:38,831 INFO  pre.ImportLogFile (ImportLogFile.java:execute(74)) - Starting Log Import 201703.gz
2017-05-09 21:43:39,821 INFO  util.Version (Version.java:logVersion(108)) - Elasticsearch Hadoop v5.2.0 [d85a257f9f]
2017-05-09 21:43:40,187 INFO  rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201703.gz/rawhttp]
2017-05-09 21:50:36,781 INFO  rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201703.gz/rawftp]
2017-05-09 22:01:56,765 INFO  pre.ImportLogFile (ImportLogFile.java:execute(78)) - Log Import complete. Time elapsed 1097 seconds
2017-05-09 22:01:56,771 INFO  pre.CrawlerDetection (CrawlerDetection.java:execute(93)) - Starting Crawler detection rawhttp.
2017-05-09 22:01:58,565 INFO  pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(159)) - Original User count: 5968
2017-05-09 22:04:14,407 INFO  pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(177)) - User count: 5922
2017-05-09 22:04:14,408 INFO  pre.CrawlerDetection (CrawlerDetection.java:execute(102)) - Crawler detection complete. Time elapsed 137 seconds
2017-05-09 22:04:14,413 INFO  pre.SessionGenerator (SessionGenerator.java:execute(66)) - Starting Session Generation.
2017-05-09 22:07:45,517 INFO  pre.SessionGenerator (SessionGenerator.java:genSessionByRefererInParallel(224)) - Initial Session count: 2859690
2017-05-09 22:20:00,967 INFO  pre.SessionGenerator (SessionGenerator.java:execute(71)) - Session generating complete. Time elapsed 946 seconds.
2017-05-09 22:20:00,975 INFO  pre.SessionStatistic (SessionStatistic.java:execute(66)) - Starting Session Summarization.
2017-05-09 22:20:06,105 INFO  pre.SessionStatistic (SessionStatistic.java:processSessionInParallel(172)) - Final Session count: 7
2017-05-09 22:20:06,107 INFO  pre.SessionStatistic (SessionStatistic.java:execute(79)) - Session Summarization complete. Time elapsed 5 seconds.
2017-05-09 22:20:06,109 INFO  pre.RemoveRawLog (RemoveRawLog.java:execute(41)) - Starting raw log removal.
2017-05-09 22:29:23,095 INFO  pre.RemoveRawLog (RemoveRawLog.java:execute(47)) - Raw log removal complete. Time elapsed 556 seconds.
2017-05-09 22:29:23,095 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(125)) - Web log preprocessing for logs dated 201703.gz complete. Time elapsed 2744 seconds.
2017-05-09 22:29:23,107 INFO  pre.HistoryGenerator (HistoryGenerator.java:execute(47)) - Starting HistoryGenerator...
/usr/local/sdeploy/mudrod-ingest/ftp-mirror/2017/03/UserHistoryMatrix.csv
podaaclog
2017-05-09 22:29:23,235 INFO  pre.HistoryGenerator (HistoryGenerator.java:execute(53)) - HistoryGenerator complete. Time elapsed 0 seconds
2017-05-09 22:29:23,242 INFO  pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(48)) - Starting ClickStreamGenerator...
podaaclog
[podaaclog201703.gz]click stream number:0
2017-05-09 22:29:23,962 INFO  pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(65)) - ClickStreamGenerator complete. Time elapsed 0 seconds.
2017-05-09 22:29:23,963 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(134)) - Web log preprocessing (user history and clickstream) complete.
2017-05-09 22:29:23,963 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:process(182)) - Starting Web log processing.
2017-05-09 22:29:23,965 INFO  process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(49)) - Starting ClickStreamAnalyzer...
2017-05-09 22:29:23,992 ERROR process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(57)) - Encountered an error during execution of ClickStreamAnalyzer.
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/sdeploy/mudrod-ingest/ftp-mirror/2017/03/ClickstreamMatrix.csv
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.ZippedWithIndexRDD.<init>(ZippedWithIndexRDD.scala:44)
    at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
    at org.apache.spark.rdd.RDD$$anonfun$zipWithIndex$1.apply(RDD.scala:1293)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.zipWithIndex(RDD.scala:1292)
    at org.apache.spark.api.java.JavaRDDLike$class.zipWithIndex(JavaRDDLike.scala:342)
    at org.apache.spark.api.java.AbstractJavaRDDLike.zipWithIndex(JavaRDDLike.scala:45)
    at gov.nasa.jpl.mudrod.utils.MatrixUtil.loadVectorFromCSV(MatrixUtil.java:352)
    at gov.nasa.jpl.mudrod.semantics.SVDAnalyzer.getSVDMatrix(SVDAnalyzer.java:54)
    at gov.nasa.jpl.mudrod.weblog.process.ClickStreamAnalyzer.execute(ClickStreamAnalyzer.java:53)
    at gov.nasa.jpl.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:186)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.startFullIngest(MudrodEngine.java:242)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.main(MudrodEngine.java:388)
java.io.FileNotFoundException: /usr/local/sdeploy/mudrod-ingest/ftp-mirror/2017/03/ClickstreamMatrix.csv (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileInputStream.<init>(FileInputStream.java:93)
    at java.io.FileReader.<init>(FileReader.java:58)
    at gov.nasa.jpl.mudrod.ssearch.ClickstreamImporter.importfromCSVtoES(ClickstreamImporter.java:73)
    at gov.nasa.jpl.mudrod.weblog.process.ClickStreamAnalyzer.execute(ClickStreamAnalyzer.java:62)
    at gov.nasa.jpl.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:186)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.startFullIngest(MudrodEngine.java:242)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.main(MudrodEngine.java:388)
2017-05-09 22:29:24,017 INFO  process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(66)) - ClickStreamAnalyzer complete. Time elapsed: 0s
2017-05-09 22:29:24,019 INFO  process.UserHistoryAnalyzer (UserHistoryAnalyzer.java:execute(47)) - Starting UserHistoryAnalyzer...
usage: MudrodEngine: 'dataDir' argument is mandatory. User must also
       provide an ingest method. -dataDir </path/to/data/directory>
       [-esHost <host_name>] [-esPort <port_num>] [-esTCPPort <port_num>]
       [-f] [-h] [-l] [-m] [-p]
 -dataDir,--dataDirectory </path/to/data/directory>      the data
                                                         directory to be
                                                         processed by
                                                         Mudrod
 -esHost,--elasticSearchHost <host_name>                 elasticsearch
                                                         cluster unicast
                                                         host
 -esPort,--elasticSearchHTTPPort <port_num>              elasticsearch
                                                         HTTP/REST port
 -esTCPPort,--elasticSearchTransportTCPPort <port_num>   elasticsearch
                                                         transport TCP
                                                         port
 -f,--fullIngest                                         begin full ingest
                                                         Mudrod workflow
 -h,--help                                               show this help
                                                         message
 -l,--logIngest                                          begin log ingest
 -m,--metaIngest                                         begin metadata
                                                         ingest
 -p,--processingWithPreResults                           begin processing
                                                         with
                                                         preprocessing
                                                         results
2017-05-09 22:29:24,109 ERROR main.MudrodEngine (MudrodEngine.java:main(398)) - Error whilst parsing command line.
java.lang.UnsupportedOperationException: empty collection
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1027)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
    at org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix.numRows(IndexedRowMatrix.scala:66)
    at org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix.toCoordinateMatrix(IndexedRowMatrix.scala:130)
    at gov.nasa.jpl.mudrod.utils.MatrixUtil.transposeMatrix(MatrixUtil.java:411)
    at gov.nasa.jpl.mudrod.utils.SimilarityUtil.calculateSimilarityFromVector(SimilarityUtil.java:63)
    at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:66)
    at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:60)
    at gov.nasa.jpl.mudrod.weblog.process.UserHistoryAnalyzer.execute(UserHistoryAnalyzer.java:51)
    at gov.nasa.jpl.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:189)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.startFullIngest(MudrodEngine.java:242)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.main(MudrodEngine.java:388)
2017-05-09 22:29:24,128 INFO  server.ServerConnector (AbstractConnector.java:doStop(306)) - Stopped ServerConnector@72a2312e{HTTP/1.1}{127.0.0.1:4041}
fgreg commented 7 years ago

In order to avoid the problem from issue https://github.com/mudrod/mudrod/issues/167 I just moved the ssl_* files out of the data directory before running full ingest.

fgreg commented 7 years ago

I get a similar error when trying to process 02/2017 logs

2017-05-09 22:56:30,414 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:<init>(51)) - Started Mudrod Weblog Discovery Engine.
2017-05-09 22:56:30,414 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(98)) - Starting Web log preprocessing.
2017-05-09 22:56:30,415 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(106)) - Processing logs dated 201702.gz
2017-05-09 22:56:30,563 INFO  pre.ImportLogFile (ImportLogFile.java:execute(74)) - Starting Log Import 201702.gz
2017-05-09 22:56:31,556 INFO  util.Version (Version.java:logVersion(108)) - Elasticsearch Hadoop v5.2.0 [d85a257f9f]
2017-05-09 22:56:31,890 INFO  rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201702.gz/rawhttp]
2017-05-09 23:00:07,458 INFO  rdd.EsRDDWriter (RestService.java:createWriter(562)) - Writing to [podaaclog201702.gz/rawftp]
2017-05-09 23:20:56,001 INFO  pre.ImportLogFile (ImportLogFile.java:execute(78)) - Log Import complete. Time elapsed 1465 seconds
2017-05-09 23:20:56,007 INFO  pre.CrawlerDetection (CrawlerDetection.java:execute(93)) - Starting Crawler detection rawhttp.
2017-05-09 23:20:58,531 INFO  pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(159)) - Original User count: 6658
2017-05-09 23:22:14,913 INFO  pre.CrawlerDetection (CrawlerDetection.java:checkByRateInParallel(177)) - User count: 6610
2017-05-09 23:22:14,914 INFO  pre.CrawlerDetection (CrawlerDetection.java:execute(102)) - Crawler detection complete. Time elapsed 78 seconds
2017-05-09 23:22:14,918 INFO  pre.SessionGenerator (SessionGenerator.java:execute(66)) - Starting Session Generation.
2017-05-09 23:24:17,775 INFO  pre.SessionGenerator (SessionGenerator.java:genSessionByRefererInParallel(224)) - Initial Session count: 1082274
2017-05-09 23:38:25,991 INFO  pre.SessionGenerator (SessionGenerator.java:execute(71)) - Session generating complete. Time elapsed 971 seconds.
2017-05-09 23:38:25,995 INFO  pre.SessionStatistic (SessionStatistic.java:execute(66)) - Starting Session Summarization.
2017-05-09 23:39:50,255 INFO  pre.SessionStatistic (SessionStatistic.java:processSessionInParallel(172)) - Final Session count: 583
2017-05-09 23:39:50,257 INFO  pre.SessionStatistic (SessionStatistic.java:execute(79)) - Session Summarization complete. Time elapsed 84 seconds.
2017-05-09 23:39:50,259 INFO  pre.RemoveRawLog (RemoveRawLog.java:execute(41)) - Starting raw log removal.
2017-05-09 23:53:10,067 INFO  pre.RemoveRawLog (RemoveRawLog.java:execute(47)) - Raw log removal complete. Time elapsed 799 seconds.
2017-05-09 23:53:10,067 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(125)) - Web log preprocessing for logs dated 201702.gz complete. Time elapsed 3399 seconds.
2017-05-09 23:53:10,075 INFO  pre.HistoryGenerator (HistoryGenerator.java:execute(47)) - Starting HistoryGenerator...
/usr/local/sdeploy/mudrod-ingest/ftp-mirror/2017/02/UserHistoryMatrix.csv
podaaclog
2017-05-09 23:53:10,228 INFO  pre.HistoryGenerator (HistoryGenerator.java:execute(53)) - HistoryGenerator complete. Time elapsed 0 seconds
2017-05-09 23:53:10,229 INFO  pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(48)) - Starting ClickStreamGenerator...
podaaclog
click stream number:861                                                         
2017-05-09 23:53:12,269 INFO  pre.ClickStreamGenerator (ClickStreamGenerator.java:execute(65)) - ClickStreamGenerator complete. Time elapsed 2 seconds.
2017-05-09 23:53:12,270 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:preprocess(134)) - Web log preprocessing (user history and clickstream) complete.
2017-05-09 23:53:12,270 INFO  discoveryengine.WeblogDiscoveryEngine (WeblogDiscoveryEngine.java:process(182)) - Starting Web log processing.
2017-05-09 23:53:12,271 INFO  process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(49)) - Starting ClickStreamAnalyzer...
2017-05-09 23:53:13,113 WARN  distributed.RowMatrix (Logging.scala:logWarning(66)) - The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
2017-05-09 23:53:13,199 WARN  netlib.ARPACK (ARPACK.java:<clinit>(61)) - Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK
2017-05-09 23:53:13,285 INFO  jni.JniLoader (JniLoader.java:liberalLoad(90)) - successfully loaded /tmp/jniloader2693392545934265752netlib-native_ref-linux-x86_64.so
2017-05-09 23:53:15,222 WARN  distributed.RowMatrix (Logging.scala:logWarning(66)) - The input data was not directly cached, which may hurt performance if its parent RDDs are also uncached.
2017-05-09 23:53:17,278 INFO  process.ClickStreamAnalyzer (ClickStreamAnalyzer.java:execute(66)) - ClickStreamAnalyzer complete. Time elapsed: 5s
2017-05-09 23:53:17,280 INFO  process.UserHistoryAnalyzer (UserHistoryAnalyzer.java:execute(47)) - Starting UserHistoryAnalyzer...
usage: MudrodEngine: 'dataDir' argument is mandatory. User must also
       provide an ingest method. -dataDir </path/to/data/directory>
       [-esHost <host_name>] [-esPort <port_num>] [-esTCPPort <port_num>]
       [-f] [-h] [-l] [-m] [-p]
 -dataDir,--dataDirectory </path/to/data/directory>      the data
                                                         directory to be
                                                         processed by
                                                         Mudrod
 -esHost,--elasticSearchHost <host_name>                 elasticsearch
                                                         cluster unicast
                                                         host
 -esPort,--elasticSearchHTTPPort <port_num>              elasticsearch
                                                         HTTP/REST port
 -esTCPPort,--elasticSearchTransportTCPPort <port_num>   elasticsearch
                                                         transport TCP
                                                         port
 -f,--fullIngest                                         begin full ingest
                                                         Mudrod workflow
 -h,--help                                               show this help
                                                         message
 -l,--logIngest                                          begin log ingest
 -m,--metaIngest                                         begin metadata
                                                         ingest
 -p,--processingWithPreResults                           begin processing
                                                         with
                                                         preprocessing
                                                         results
2017-05-09 23:53:17,344 ERROR main.MudrodEngine (MudrodEngine.java:main(398)) - Error whilst parsing command line.
java.lang.UnsupportedOperationException: empty collection
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1027)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1027)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
    at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.computeSize(CoordinateMatrix.scala:151)
    at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.numCols(CoordinateMatrix.scala:58)
    at org.apache.spark.mllib.linalg.distributed.CoordinateMatrix.transpose(CoordinateMatrix.scala:75)
    at gov.nasa.jpl.mudrod.utils.MatrixUtil.transposeMatrix(MatrixUtil.java:411)
    at gov.nasa.jpl.mudrod.utils.SimilarityUtil.calculateSimilarityFromVector(SimilarityUtil.java:63)
    at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:66)
    at gov.nasa.jpl.mudrod.semantics.SemanticAnalyzer.calTermSimfromMatrix(SemanticAnalyzer.java:60)
    at gov.nasa.jpl.mudrod.weblog.process.UserHistoryAnalyzer.execute(UserHistoryAnalyzer.java:51)
    at gov.nasa.jpl.mudrod.discoveryengine.WeblogDiscoveryEngine.process(WeblogDiscoveryEngine.java:189)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.startFullIngest(MudrodEngine.java:242)
    at gov.nasa.jpl.mudrod.main.MudrodEngine.main(MudrodEngine.java:388)
quintinali commented 7 years ago

@fgreg Could you please send the HTTP and FTP logs of 201703 to me or attached them to the comment? I can only download ssl_WWW.201703.gz from the link ftp://podaac.jpl.nasa.gov/misc/outgoing/cjf/mudrod/2017/03

fgreg commented 7 years ago

@quintinali you can use the 02/2017 logs to reproduce this problem as well. If we get the 'ssl_' prefix sorted I'm sure the ssl logs will exhibit the same behavior.

The ssl logs are the ones we should be ingesting. If you need to test without fixing the prefix problem https://github.com/mudrod/mudrod/issues/167 , just change the name and remove 'ssl'

fgreg commented 7 years ago

Fixed by PR https://github.com/mudrod/mudrod/pull/171