Open piranha32 opened 6 years ago
Subsequent execution of parse.py with the same input data, and without deleting the output directory completed successfully, but the index seems to be corrupted. Twofishes starts, responds to queries without crashing, but throws exceptions on each query:
WARNING: Unexpected EOF reading file:data/prefix_index/index at entry #0. Ignoring.
Mar 21, 2018 1:27:08 AM io.fsq.twofishes.server.HandleExceptions$$anonfun$apply$6 applyOrElse
SEVERE: got error: java.io.EOFException: reached EOF while trying to read 13 bytes
java.io.EOFException: reached EOF while trying to read 13 bytes
at io.fsq.twofishes.core.MMapInputStreamImpl.read(MMapInputStream.scala:91)
at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:160)
at io.fsq.twofishes.core.MMapInputStream.read(MMapInputStream.scala:139)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:120)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2359)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2258)
at io.fsq.twofishes.core.MapFileConcurrentReader.findPosition(MapFileConcurrentReader.java:245)
at io.fsq.twofishes.core.MapFileConcurrentReader.get(MapFileConcurrentReader.java:279)
at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:205)
at io.fsq.twofishes.server.MapFileInput$$anonfun$6.apply(HFileStorageService.scala:203)
at io.fsq.twofishes.util.DurationUtils$.inMilliseconds(DurationUtils.scala:10)
at io.fsq.twofishes.server.MapFileInput.lookup(HFileStorageService.scala:203)
at io.fsq.twofishes.server.PrefixIndexMapFileInput.get(HFileStorageService.scala:310)
at io.fsq.twofishes.server.NameIndexHFileInput.getPrefix(HFileStorageService.scala:244)
at io.fsq.twofishes.server.HFileStorageService.getIdsByNamePrefix(HFileStorageService.scala:54)
at io.fsq.twofishes.server.HotfixableGeocodeStorageService.getIdsByNamePrefix(HotfixableGeocodeStorageService.scala:22)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:220)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl$$anonfun$9.apply(AutocompleteGeocoderImpl.scala:198)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParsesHelper(AutocompleteGeocoderImpl.scala:198)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl.generateAutoParses(AutocompleteGeocoderImpl.scala:275)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:294)
at io.fsq.twofishes.server.AutocompleteGeocoderImpl.doGeocodeImpl(AutocompleteGeocoderImpl.scala:31)
at io.fsq.twofishes.server.AbstractGeocoderImpl$$anonfun$doGeocode$1.apply(AbstractGeocoderImpl.scala:33)
at com.twitter.util.Duration$.inMilliseconds(Duration.scala:183)
at com.twitter.ostrich.stats.StatsProvider$class.time(StatsProvider.scala:196)
at com.twitter.ostrich.stats.StatsCollection.time(StatsCollection.scala:31)
at io.fsq.twofishes.server.AbstractGeocoderImpl$Stats$.time(AbstractGeocoderImpl.scala:43)
at io.fsq.twofishes.server.AbstractGeocoderImpl.doGeocode(AbstractGeocoderImpl.scala:32)
at io.fsq.twofishes.server.GeocodeRequestDispatcher.geocode(GeocodeRequestDispatcher.scala:28)
at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
at io.fsq.twofishes.server.GeocodeServerImpl$$anonfun$geocode$2.apply(GeocodeServer.scala:218)
at com.twitter.util.Try$.apply(Try.scala:13)
at com.twitter.util.ExecutorServiceFuturePool$$anon$2.run(FuturePool.scala:115)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Preface this 'solution' by saying that I'm not a Scala programmer ... so there may be a better way to handle this. However, I was able to temporarily get past this build error, and avoid the corrupted index by editing fsqio/src/jvm/io/fsq/twofishes/indexer/output/PrefixIndexer.scala to add a try/catch block around the section of code where the error occurs (example below). In the current version of this file, that means enclosing lines 115 through 140.
for {
(prefix, index) <- sortedPrefixes.zipWithIndex
} {
if (index % 1000 == 0) {
log.info("done with %d of %d prefixes".format(index, numPrefixes))
}
try {
val records = getRecordsByPrefix(prefix, PrefixIndexer.MaxNamesToConsider)
val (woeMatches, woeMismatches) = records.partition(r =>
bestWoeTypes.contains(r.woeTypeOrThrow))
val (prefSortedRecords, unprefSortedRecords) =
sortRecordsByNames(woeMatches.toList)
val fids = new HashSet[StoredFeatureId]
//roundRobinByCountryCode(prefSortedRecords).foreach(f => {
prefSortedRecords.foreach(f => {
if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
fids.add(f.fidAsFeatureId)
}
})
if (fids.size < PrefixIndexer.MaxFidsWithPreferredNamesBeforeConsideringNonPreferred) {
//roundRobinByCountryCode(unprefSortedRecords).foreach(f => {
unprefSortedRecords.foreach(f => {
if (fids.size < PrefixIndexer.MaxFidsToStorePerPrefix) {
fids.add(f.fidAsFeatureId)
}
})
}
prefixWriter.append(prefix, fidsToCanonicalFids(fids.toList))
} catch {
case e: Exception => println("Skipping due to error processing prefixes")
}
}
If you already built the mongo db, you'll obviously want to throw that out and start from mongod --dbpath /local/directory/for/output/
on the import instructions.
During importing the data into the DB with "./src/jvm/io/fsq/twofishes/scripts/parse.py -w
pwd
/data/" the indexer crashed with the following error message. After the crash the indexer was stuck doing nothing, and the no more records were processed. Processed data was downloaded with "./src/jvm/io/fsq/twofishes/scripts/download-world.sh"