USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Investigate pipeline frameworks #203

Open buggtb opened 3 years ago

buggtb commented 3 years ago

Would something like Apache Beam, be a more modern way of doing the same Spark stuff but in an agnostic fashion?

This would allow us to be less dependant on spark versions, which is a jar/packaging PITA and give users the option to run on a range of different engines: https://beam.apache.org/documentation/runners/capability-matrix/

lewismc commented 3 years ago

I recently saw liminal which builds off of airflow. The Airflow programming model is very intuitive. I've never used Beam so I cannot comment on it.