foursquare / fsqio

A monorepo that holds all of Foursquare's opensource projects
Apache License 2.0
253 stars 54 forks source link

Twofishes indexer does not handle MongoDB query failures properly. #53

Open piranha32 opened 6 years ago

piranha32 commented 6 years ago

During importing the data into the DB with "./src/jvm/io/fsq/twofishes/scripts/parse.py -w pwd/data/" the indexer crashed with the following error message. After the crash the indexer was stuck doing nothing, and the no more records were processed. Processed data was downloaded with "./src/jvm/io/fsq/twofishes/scripts/download-world.sh"

8246718 INFO  i.f.t.indexer.output.PrefixIndexer - done with 114000 of 2994876 prefixes
8247980 INFO  i.f.t.indexer.output.PrefixIndexer - done with 115000 of 2994876 prefixes
Exception in thread "main" io.fsq.rogue.RogueException: Mongo query on geocoder [db.name_index.find({ "name" : { "$regex" : "^?" , "$options" : ""} , "excludeFromPrefixIndex" : { "$ne" : true}}).sort({ "pop" : -1}).limit(1000)] failed after 2 ms
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:118)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter.queryRunner(MongoClientAdapter.scala:471)
        at io.fsq.rogue.adapter.MongoClientAdapter.query(MongoClientAdapter.scala:496)
        at io.fsq.rogue.query.QueryExecutor.fetch(QueryExecutor.scala:144)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.getRecordsByPrefix(PrefixIndexer.scala:76)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:115)
        at io.fsq.twofishes.indexer.output.PrefixIndexer$$anonfun$writeIndexImpl$2.apply(PrefixIndexer.scala:110)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at io.fsq.twofishes.indexer.output.PrefixIndexer.writeIndexImpl(PrefixIndexer.scala:110)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.NameIndexer.writeIndexImpl(NameIndexer.scala:84)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply$mcV$sp(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.Indexer$$anonfun$writeIndex$1.apply(Indexer.scala:54)
        at io.fsq.twofishes.util.DurationUtils$.inNanoseconds(DurationUtils.scala:16)
        at io.fsq.twofishes.util.DurationUtils$class.logDuration(DurationUtils.scala:23)
        at io.fsq.twofishes.indexer.output.Indexer.logDuration(Indexer.scala:37)
        at io.fsq.twofishes.indexer.output.Indexer.writeIndex(Indexer.scala:54)
        at io.fsq.twofishes.indexer.output.OutputIndexes.buildIndexes(OutputHFile.scala:28)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.writeIndexes(GeonamesParser.scala:144)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser$.main(GeonamesParser.scala:106)
        at io.fsq.twofishes.indexer.importers.geonames.GeonamesParser.main(GeonamesParser.scala)
Caused by: com.mongodb.MongoQueryException: Query failed with error code 2 and error message 'Regular expression is invalid: invalid UTF-8 string' on server 127.0.0.1:27017
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:521)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:510)
        at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:431)
        at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:404)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:510)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
        at com.mongodb.Mongo.execute(Mongo.java:836)
        at com.mongodb.Mongo$2.execute(Mongo.java:823)
        at com.mongodb.OperationIterable.iterator(OperationIterable.java:47)
        at com.mongodb.OperationIterable.forEach(OperationIterable.java:70)
        at com.mongodb.FindIterableImpl.forEach(FindIterableImpl.java:166)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:162)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.forEachProcessor(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$query$1.apply(MongoClientAdapter.scala:496)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:272)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.findImpl(BlockingMongoClientAdapter.scala:52)
        at io.fsq.rogue.adapter.MongoClientAdapter$$anonfun$queryRunner$1.apply(MongoClientAdapter.scala:472)
        at io.fsq.rogue.util.DefaultQueryLogger.onExecuteQuery(QueryLogger.scala:52)
        at io.fsq.rogue.adapter.BlockingMongoClientAdapter.runCommand(BlockingMongoClientAdapter.scala:113)
        ... 30 more
piranha32 commented 6 years ago

Subsequent execution of parse.py with the same input data, and without deleting the output directory completed successfully, but the index seems to be corrupted. Twofishes starts, responds to queries without crashing, but throws exceptions on each query:

WARNING: Unexpected EOF reading file:data/prefix_index/index at entry #0.  Ignoring.
Mar 21, 2018 1:27:08 AM io.fsq.twofishes.server.HandleExceptions$$anonfun$apply$6 applyOrElse
SEVERE: got error: java.io.EOFException: reached EOF while trying to read 13 bytes
java.io.EOFException: reached EOF while trying to read 13 bytes
        at io.fsq.twofishes.core.MMapInputStreamImpl.read(MMapInputStream.scala:91)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:160)
        at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:139)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2258)
        at io.fsq.twofishes.core.MapFileConcurrentReader.findPosition(MapFileConcurrentReader.java:245)
        at io.fsq.twofishes.core.MapFileConcurrentReader.get(MapFileConcurrentReader.java:279)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:205)
        at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:203)
        at io.fsq.twofishes.util.DurationUtils$.inMilliseconds(DurationUtils.scala:10)
        at io.fsq.twofishes.server.MapFileInput.lookup(HFileStorageService.scala:203)
        at io.fsq.twofishes.server.PrefixIndexMapFileInput.get(HFileStorageService.scala:310)
        at io.fsq.twofishes.server.NameIndexHFileInput.getPrefix(HFileStorageService.scala:244)
        at io.fsq.twofishes.server.HFileStorageService.getIdsByNamePrefix(HFileStorageService.scala:54)
        at io.fsq.twofishes.server.HotfixableGeocodeStorageService.getIdsByNamePrefix(HotfixableGeocodeStorageService.scala:22)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:220)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:198)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.Range.foreach(Range.scala:160)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.AbstractTraversable.map(Traversable.scala:104)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParsesHelper(AutocompleteGeocoderImpl.scala:198)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParses(AutocompleteGeocoderImpl.scala:275)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:294)
        at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$$anonfun$doGeocode$1.apply(AbstractGeocoderImpl.scala:33)
        at com.twitter.util.Duration$.inMilliseconds(Duration.scala:183)
        at com.twitter.ostrich.stats.StatsProvider$class.time(StatsProvider.scala:196)
        at com.twitter.ostrich.stats.StatsCollection.time(StatsCollection.scala:31)
        at io.fsq.twofishes.server.AbstractGeocoderImpl$Stats$.time(AbstractGeocoderImpl.scala:43)
        at io.fsq.twofishes.server.AbstractGeocoderImpl.doGeocode(AbstractGeocoderImpl.scala:32)
        at io.fsq.twofishes.server.GeocodeRequestDispatcher.geocode(GeocodeRequestDispatcher.scala:28)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
        at com.twitter.util.Try$.apply(Try.scala:13)
        at com.twitter.util.ExecutorServiceFuturePool$$anon$2.run(FuturePool.scala:115)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
jglesner commented 6 years ago

Preface this 'solution' by saying that I'm not a Scala programmer ... so there may be a better way to handle this. However, I was able to temporarily get past this build error, and avoid the corrupted index by editing fsqio/src/jvm/io/fsq/twofishes/indexer/output/PrefixIndexer.scala to add a try/catch block around the section of code where the error occurs (example below). In the current version of this file, that means enclosing lines 115 through 140.

    for {
      (prefix, index) <- sortedPrefixes.zipWithIndex
    } {
      if (index % 1000 == 0) {
        log.info("done with %d of %d prefixes".format(index, numPrefixes))
      }
      try {
         val records = getRecordsByPrefix(prefix, PrefixIndexer.MaxNamesToConsider)

         val (woeMatches, woeMismatches) = records.partition(r =>
           bestWoeTypes.contains(r.woeTypeOrThrow))

         val (prefSortedRecords, unprefSortedRecords) =
           sortRecordsByNames(woeMatches.toList)

         val fids = new HashSet[StoredFeatureId]
         //roundRobinByCountryCode(prefSortedRecords).foreach(f => {
         prefSortedRecords.foreach(f => {
           if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
             fids.add(f.fidAsFeatureId)
           }
         })

         if (fids.size < PrefixIndexer.MaxFidsWithPreferredNamesBeforeConsideringNonPreferred) {
           //roundRobinByCountryCode(unprefSortedRecords).foreach(f => {
           unprefSortedRecords.foreach(f => {
             if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
               fids.add(f.fidAsFeatureId)
             }
           })
         }

         prefixWriter.append(prefix, fidsToCanonicalFids(fids.toList))
      } catch {
         case e: Exception => println("Skipping due to error processing prefixes")
      }
}

If you already built the mongo db, you'll obviously want to throw that out and start from mongod --dbpath /local/directory/for/output/ on the import instructions.