Open ravindrabajpai opened 2 years ago
I tried a work-around by removing this line from the StatusUpdateSolrTransformer - //Constants.storage.CONTENTHASH -> ContentHash.fetchHash(data.fetchedData.getContent)
And it works for me for now.
But my hunch is that there is a better solution and maybe I am missing something in the configurations.
Hi @ravindrabajpai thanks for reporting the bug!
I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr,
the content signature cannot be calculated at inject phase as it is based on Webpage content rather than the URL.
then why it is expected while fetching.
I suspect it is expected 'after' fetching but before indexing.
But my hunch is that there is a better solution and maybe I am missing something in the configurations.
Can you check that the webpage content was actually fetched?
Hi @lewismc
Thanks for replying. Yes I could see the webpage content was fetched correctly. I injected total 2 urls (additionally : edition.cnn.com) and both were fetched and stored correctly in the solr. there were about 300+ doc for both the sources (websites).
For all the Steps I did - https://github.com/ravindrabajpai/ana/blob/main/ground_zero
thanks.
Issue Description
I am trying to build and run the sparkler from the source. I am following the example given in the readme. I have injected a url and is visible in solr. I face problem while crawling and see given below error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (ip-172-31-39-218.ap-southeast-1.compute.internal executor driver): org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: ERROR: [doc=BB1D50CFC203F0FF85208DD1A4D48EB99DA051BCBDF6279E3DC62BDE6FFFA05C] unknown field 'contenthash'
How to reproduce it
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main inject -id sjob-1 -su https://news.bbc.co.uk
java -Xms1g -cp /home/ubuntu/sparkler/sparkler-core/build/conf:$(echo /home/ubuntu/sparkler/sparkler-core/build/sparkler-app-0.5.24-SNAPSHOT/lib/*.jar | tr ' ' ':') -Dpf4j.pluginsDir=/home/ubuntu/sparkler/sparkler-core/build/plugins edu.usc.irds.sparkler.Main crawl -id sjob-1 -tn 10 -i 1
Additional changes: I have modified Crawler.scala and have added below code at line 171
conf.set("spark.io.compression.codec", "snappy")
Please let me know how to pass spark-conf in the runtime configurations so that I can avoid doing this.Environment and Version Information
Please indicate relevant versions, including, if relevant:
Java Version openjdk version "1.8.0_312" OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~20.04-b07) OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
Spark Version - 3.0.3, Scala version 2.12.10
Operating System name and version - AWS Instance based on 20.04.1-Ubuntu
Solr - 8.5.0 (in local mode)
I see the Content Hash object in the sparkler-core code, but do not see it getting injected in the solr, then why it is expected while fetching. The same error I see in the solr.log
StackTrace from sparkler-crawl -