IKANOW / Aleph2

The IKANOW v2 meta-database and analytics platform
Apache License 2.0
2 stars 1 forks source link

Deduplication Service processing too few inputs. #103

Open cburch opened 8 years ago

cburch commented 8 years ago

Need to investigate why when running a dedupe job, the input set will cut off early and only process the first x files.

Cluster: CE cluster - we did some minor testing on nst and it was not occuring there Source: XXTestStaging - just an analytics job that has a dedupe step -> JS that just spits out first entry everytime, no special proccessing, nothing time consuming Input: I cp a ~500k record json file into the sources ready folder (via /temp), nothing special, about 6 fields/values (There is a script on host 21 /tmp/test1.sh that will do this for you, cron job at /etc/cron.d/test that will do it every 3 min for extended testing)

Affects:

  1. It always breaks on an interval of 100 (aka in:1700 out:1700)
  2. Map Reduce throws an exception: Error: java.lang.NullPointerException at com.ikanow.aleph2.shared.crud.elasticsearch.services.ElasticsearchCrudService$ElasticsearchBatchSubsystem.getPossibleDeletionRequest(ElasticsearchCrudService.java:1272) at com.ikanow.aleph2.shared.crud.elasticsearch.services.ElasticsearchCrudService$ElasticsearchBatchSubsystem.storeObject(ElasticsearchCrudService.java:1313) at com.ikanow.aleph2.core.shared.services.MultiDataService.batchWrite(MultiDataService.java:268) at com.ikanow.aleph2.analytics.services.AnalyticsContext.emitObject(AnalyticsContext.java:1224) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentBaseMapper.lambda$null$8(BatchEnrichmentJob.java:530) at java.util.ArrayList.forEach(ArrayList.java:1249) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentBaseMapper.lambda$completeBatchFinalStage$9(BatchEnrichmentJob.java:529) at java.util.Optional.orElseGet(Optional.java:267) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentBaseMapper.completeBatchFinalStage(BatchEnrichmentJob.java:527) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentBase.checkBatch(BatchEnrichmentJob.java:282) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentBase.cleanup(BatchEnrichmentJob.java:297) at com.ikanow.aleph2.analytics.hadoop.assets.BatchEnrichmentJob$BatchEnrichmentMapper.cleanup(BatchEnrichmentJob.java:581) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:149) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
  3. It does not appear to be correlated to host (will work on all hosts eventually, fails on all hosts eventually).
  4. If you just use a passthrough instead of dedupe service, it will succeed with 100% success rate, so its not a problem in reading from the ready folder as both do this. Does fail w/ Java processing block as well as JS though so it doesn't seem correlated to just JS. **Actually this bit is kind of crazy, because the passthrough service is just a Java implementation of IBatchEnrichmentModule so it should act the same as my Java version, retesting to make sure.
cburch commented 8 years ago

Passthrough module does eventually fail, but I'm not sure if this is a factor of starting to get behind (I have a cron job dropping new data in every 3 minutes, passthrough just emits everything so the index gets 500k larger every 3min until it gets behind in the job).

It took approx 10 runs before I saw it output less records than it should, any other type of job (specifically where it replaces previous data) it takes about 3 on average.