USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

data from JS pages is not returned #174

Closed chaitra-rs closed 4 years ago

chaitra-rs commented 4 years ago

Trying to get data from a page which has JS scripts. Page shows data but output file doesn't.

Please describe our issue, along with:

How to reproduce it

Try injecting the page https://www.ibm.com/support/pages/node/884036 Data in Document Information will be missed out in part-00000 file

Environment and Version Information

Please indicate relevant versions, including, if relevant:

Followed the steps in the document (used docker) shows sparkler@5c224e262222:/$

thammegowda commented 4 years ago

@chaitra-rs thanks for reporting this. By default, the javascript execution is not enabled (why? JS execution is very very slow)

However, if you do need it, you need to enable a plugin.

https://github.com/USCDataScience/sparkler/blob/5c2201310623b70e6bf024e51e521eb4bffc4723/conf/sparkler-default.yaml#L102-L104

how? locate sparkler-default.yamlinside the docker that is being used by sparkler, and uncomment one of those fetcher plugins capable of executing javascript.

I dont know which one is best since each has their +s and -s (suggest trial and error for your usecase.).