USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
412 stars 143 forks source link

Mvn2sbt #232

Closed buggtb closed 2 years ago

buggtb commented 3 years ago

Works, but needs some finishing off, but to expose it here, we have a working SBT build now.

Why move from Maven to SBT you ask? Because then IDE's can load up the code and work properly with the Java and Scala mix rather than the crazy state we were in with the old build.

lewismc commented 3 years ago

@buggtb nice work.... What finishing off does it need? I will test this out when I get some time to focus on reviewing it properly.

lewismc commented 2 years ago

Looks like this could be merged @buggtb and then we could open subsequent issues to address shortcomings. WDYT?

buggtb commented 2 years ago

I have the worlds largest PR coming at some point soon @lewismc which covers off loads of new functionality, featrures and support as we've been using sparkler on a fork but have permission to push it all upstream, so I'm going to add to this over the next week or two and drop in SBT support, some new plugins, better spark submit support, file writing, databricks integration hooks and other stuff.

buggtb commented 2 years ago

Okay got bored quickly, rest of this stuff is landing now. It looks like mush but we use it in production so whilst it may not have had wide spread testing, it is usable. i just need to document all the new stuff :)

lewismc commented 2 years ago

holy sparkler batman

lewismc commented 2 years ago

Ready for review? @buggtb

buggtb commented 2 years ago

haha plausibly, I fixed the merge issues, should be good.

lewismc commented 2 years ago

@buggtb can you please update the README or point me at some documentation for the Gradle build commands?

% gradle tasks

> Task :tasks

------------------------------------------------------------
Tasks runnable from root project
------------------------------------------------------------

Build Setup tasks
-----------------
init - Initializes a new Gradle build.
wrapper - Generates Gradle wrapper files.

Help tasks
----------
buildEnvironment - Displays all buildscript dependencies declared in root project 'sparkler-core'.
components - Displays the components produced by root project 'sparkler-core'. [incubating]
dependencies - Displays all dependencies declared in root project 'sparkler-core'.
dependencyInsight - Displays the insight into a specific dependency in root project 'sparkler-core'.
dependentComponents - Displays the dependent components of components in root project 'sparkler-core'. [incubating]
help - Displays a help message.
model - Displays the configuration model of root project 'sparkler-core'. [incubating]
outgoingVariants - Displays the outgoing variants of root project 'sparkler-core'.
projects - Displays the sub-projects of root project 'sparkler-core'.
properties - Displays the properties of root project 'sparkler-core'.
tasks - Displays the tasks runnable from root project 'sparkler-core'.

To see all tasks and more detail, run gradle tasks --all

To see more detail about a task, run gradle help --task <task>

BUILD SUCCESSFUL in 469ms
1 actionable task: 1 executed

Out of curiosity have you ever used the Kotlin syntax instead of legacy Groovy syntax?

buggtb commented 2 years ago

Just shoved an update that has a few fixes in, some updated internal libraries and a revamped default crawler that uses the more flexible Apache HTTP Client and has support both in Chrome and the default crawler for an HTTP Proxy (we tested/use ProxyMesh)

lewismc commented 2 years ago

@buggtb do you have a README for building now?

buggtb commented 2 years ago

Inside sparkler-core, just run sbt package assembly and it'll create the same build folder as old.

Once we get this merged down, we'll move our development directly against this repo on the usual branching strategy so that we don't have PR's like this again. I've also got rotating proxy pool support and fixes for dodgy SSL sides and stuff that all need to go in once this is done.

lewismc commented 2 years ago

@buggtb I can't get a clean build because Javadoc generation fails... with quite a few issues. Do you also experience this behaviour?

lewismc commented 2 years ago

I mean I can jump in and start updating all of the Javadoc's if you want. That's not an issue. I just want to know if you have a clean build or not.