USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 141 forks source link

Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=<>] unknown field 'contenthash' #247

Open ravindrabajpai opened 2 years ago

ravindrabajpai commented 2 years ago

Issue Description

I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr. I face problem while crawling and see given below error -

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'

How to reproduce it

  1. git clone the main branch.
  2. build sparkler-core
  3. modify /home/ubuntu/sparkler/sparkler-core/build/conf/sparkler-default.yaml
    crawldb.backend: solr  # "solr" is default until "elasticsearch" becomes usable.
    solr.uri: http://localhost:8983/solr/crawldb
  4. Run following command to inject - java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
  5. Run following command to crawl - java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1

Additional changes: I have modified Crawler.scala and have added below code at line 171 conf.set("spark.io.compression.codec", "snappy") Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log

2022-01-31 04:44:54.871 ERROR (qtp1984990929-17) [   x:crawldb] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:226)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:109)
        at org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:977)

StackTrace from sparkler-crawl -

04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManagerMaster - Updated info of block rdd_7_0
04:44:54.877 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.storage.BlockManager - Told master about block rdd_7_0
04:44:54.880 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 3.0 (TID 3)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:665)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
    at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
    at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResource(SolrProxy.scala:121)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:158)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:37)
    at scala.collection.Iterator.toStream(Iterator.scala:1417)
    at scala.collection.Iterator.toStream$(Iterator.scala:1416)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.toStream(FairFetcher.scala:37)
    at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:336)
    at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:336)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.toSeq(FairFetcher.scala:37)
    at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$3(Crawler.scala:258)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1418)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1345)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1409)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1230)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:131)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
04:44:54.908 [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] DEBUG org.apache.spark.executor.ExecutorMetricsPoller - removing (3, 0) from stageTCMP
ravindrabajpai commented 2 years ago

I tried a work-around by removing this line from the StatusUpdateSolrTransformer - //Constants.storage.CONTENTHASH -> ContentHash.fetchHash(data.fetchedData.getContent)

And it works for me for now.

But my hunch is that there is a better solution and maybe I am missing something in the configurations.

lewismc commented 2 years ago

Hi @ravindrabajpai thanks for reporting the bug!

I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr,

the content signature cannot be calculated at inject phase as it is based on Webpage content rather than the URL.

then why it is expected while fetching.

I suspect it is expected 'after' fetching but before indexing.

But my hunch is that there is a better solution and maybe I am missing something in the configurations.

Can you check that the webpage content was actually fetched?

ravindrabajpai commented 2 years ago

Hi @lewismc

Thanks for replying. Yes I could see the webpage content was fetched correctly. I injected total 2 urls (additionally : edition.cnn.com) and both were fetched and stored correctly in the solr. there were about 300+ doc for both the sources (websites).

For all the Steps I did - https://github.com/ravindrabajpai/ana/blob/main/ground_zero

thanks.