WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

V3.0/fix prod issues #28

Closed leefrank9527 closed 3 years ago

leefrank9527 commented 3 years ago

Fixed issues from production environment of NLNZ.

hannakoppelaar commented 3 years ago

I'm done reviewing the code, but I'm having trouble running the new binaries. Apparently, this webapp now includes servlet-api-2.5-20081211.jar in the lib directory, which contains an old servlet api, and (on my machine at least) this gets loaded before the newer one that we actually rely on. As a consequence webapp doesn't run. If I build master, everything's fine: servlet-api-2.5-20081211.jar doesn't get added.

leefrank9527 commented 3 years ago

@hannakoppelaar I double checked the webcurator-webapp-3.0.0-SNAPSHOT.war and the gradle dependencies, and didn't find the servlet-api-2.5-20081211.jar. Would you mind print the gradle dependencies of your environment with command: ./gradlew dependencies > webapp.dep ?

hannakoppelaar commented 3 years ago

@leefrank9527 Sure, my output of ./gradlew dependencies shows the 2.5 servlet-api as a dependency of core:

...
+--- org.webcurator:webcurator-core:3.0.0-SNAPSHOT
|    +--- org.archive:heritrix:1.14.2-webcuratortool-2.0.2
|    |    +--- net.htmlparser.jericho:jericho-html:2.6.1
|    |    +--- com.sleepycat:je:3.3.62 -> 4.1.6
|    |    +--- commons-httpclient:commons-httpclient:3.1.1-heritrix-1.14.2-webcuratortool-2.0.1 (*)
|    |    +--- commons-io:commons-io:1.3.1 -> 2.4
|    |    +--- commons-lang:commons-lang:2.3 -> 2.6
|    |    +--- commons-logging:commons-logging:1.0.4 -> 1.2
|    |    +--- commons-net:commons-net:1.4.1 -> 2.0
|    |    +--- commons-codec:commons-codec:1.3 -> 1.11
|    |    +--- dnsjava:dnsjava:2.0.6
|    |    +--- org.mortbay.jetty:jetty:4.2.12 -> 6.1.26
|    |    |    +--- org.mortbay.jetty:jetty-util:6.1.26
|    |    |    \--- org.mortbay.jetty:servlet-api:2.5-20081211
...

Though, strangely, if I run the command in the webcurator-core dir the module does not show up.

hannakoppelaar commented 3 years ago

Interestinly, if I comment out this line in build.gradle (in webapp)

implementation 'org.archive.heritrix:heritrix-engine:3.4.0-SNAPSHOT'

the jetty servlet-api dependency disappears.

leefrank9527 commented 3 years ago

@hannakoppelaar 'org.archive.heritrix:heritrix-engine:3.4.0-SNAPSHOT' is used to validate the profile, I added it to fix an issue. I couldn't find the lib in webcurator-core either. The dependencies in my environment is quite different from yours. My dependencies of webcurator-core:

+--- org.archive:heritrix:1.14.2-webcuratortool-2.0.2
|    +--- net.htmlparser.jericho:jericho-html:2.6.1
|    +--- com.sleepycat:je:3.3.62 -> 3.3.74
|    +--- commons-httpclient:commons-httpclient:3.1.1-heritrix-1.14.2-webcuratortool-2.0.1
|    |    +--- commons-cli:commons-cli:1.1
|    |    +--- commons-logging:commons-logging:1.0.4
|    |    +--- commons-codec:commons-codec:1.2 -> 1.10
|    |    \--- com.sleepycat:je:3.3.62 -> 3.3.74
|    +--- commons-io:commons-io:1.3.1
|    +--- commons-lang:commons-lang:2.3
|    +--- commons-logging:commons-logging:1.0.4
|    +--- commons-net:commons-net:1.4.1
|    |    \--- oro:oro:2.0.8
|    +--- commons-codec:commons-codec:1.3 -> 1.10
|    +--- dnsjava:dnsjava:2.0.6
|    +--- org.mortbay.jetty:jetty:4.2.12
|    +--- com.anotherbigidea:javaswf:CVS-SNAPSHOT-1
|    +--- com.lowagie:itext:1.2.3
|    +--- org.apache.ant:ant:1.7.1
|    |    \--- org.apache.ant:ant-launcher:1.7.1
|    +--- commons-collections:commons-collections:3.1
|    +--- commons-cli:commons-cli:1.0 -> 1.1
|    +--- it.unimi.dsi:mg4j:2.0.1
|    +--- fastutil:fastutil:5.0.9
|    +--- org.gnu.inet:libidn:0.6.5
|    +--- org.apache.mahout.jets3t:jets3t:0.6.1
|    \--- junit:junit:3.8.2

Is it possible to re-build and re-install webcurator-core before building webcurator-webapp?

BTW: I'm building with:

------------------------------------------------------------
Gradle 5.6
------------------------------------------------------------

Build time:   2019-08-14 21:05:25 UTC
Revision:     f0b9d60906c7b8c42cd6c61a39ae7b74767bb012

Kotlin:       1.3.41
Groovy:       2.5.4
Ant:          Apache Ant(TM) version 1.9.14 compiled on March 12 2019
JVM:          1.8.0_275 (Private Build 25.275-b01)
OS:           Linux 5.8.0-41-generic amd64
hannakoppelaar commented 3 years ago

@leefrank9527 Apparently, I had an old heritrix 3.4.0-SNAPSHOT in my m2 repo that triggered the inclusion of the old servlet api. After deleting it, the servlet 2.5 api is gone from the dependency tree.

This does bring us to another issue though: how is the heritrix 3.4.0 artefact supposed to be included by gradle? It's not in any of the configured repos, is it? Should it be added to the install_maven_dependencies script? Or is there a repo somewhere with Heritrix builds? Obviously, I know how to build Heritrix, but our build process should not rely on that knowledge ;)

leefrank9527 commented 3 years ago

@hannakoppelaar You are right, this is a potential issue. Gradle will cascadely manage dependencies. Heritrix 3.4.0-SNAPSHOT is dependent on by heritrix-engine, openwayback and other aitifacts. We can't abandon this with"exclude" announcement in the build.gradle. Another trouble is gradle will try to find the dependencies from local repository (.m2). So if there is an old jars in .m2, it's hard for us to ignore it automaticly.

To make a completely clean build. I think we could remove all aitificts in .m2, and inital them with webcurator-legacy-lib-dependencies/. But if we have other projects shared the .m2, that would be inconvenient.

obrienben commented 3 years ago

@hannakoppelaar H3 looks to now be in maven central - https://mvnrepository.com/artifact/org.archive.heritrix/heritrix-engine. So it should be ok to get pulled in via Gradle right, instead of webcurator-legacy-lib-dependencies?

hannakoppelaar commented 3 years ago

@obrienben yeah, I've now built it from scratch (starting with an empty m2 repo) using heritrix-engine:3.4.0-20190205 from maven central. I.e. I've changed the heritrix dependency in webapp's build.gradle to

implementation 'org.archive.heritrix:heritrix-engine:3.4.0-20190205'

And it builds okay. The sad news is: I'm still getting the old jetty servlet api in my build. I'm using the same gradle version as @leefrank9527: 5.6 with JDK 1.8.0_275.