Closed ruebot closed 2 years ago
Looks like this can effect older versions of aut
as well, so it's not necessarily a Java 11 and Spark 3+ issue with the RecordLoader.
Tested on the issue-494 branch with Java 11 and Spark 3.0.3 as part of #533 testing (with only 8G of RAM!!):
$ export JAVA_HOME=/usr/lib/jvm/java-1.11.0-openjdk-amd64 ./spark-shell --master local[8] --driver-memory 8g --jars /home/ruestn/aut/target/aut-0.91.1-SNAPSHOT-fatjar.jar
import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._
sc.setLogLevel("INFO")
// Web archive collection; web pages.
val webpages = RecordLoader.loadArchives("/tuna1/scratch/nruest/auk_collection_testing/11989/warcs/*", sc)
.webpages()
// Web archive collection; web graph.
val webgraph = RecordLoader.loadArchives("/tuna1/scratch/nruest/auk_collection_testing/11989/warcs/*", sc)
.webgraph()
// Domains file.
webpages.groupBy(removePrefixWWW(extractDomain($"Url")).alias("url"))
.count()
.sort($"count".desc)
.write.csv("/tuna1/scratch/nruest/auk_collection_testing/11989/all-domains")
// Full-text.
webpages.select($"crawl_date", removePrefixWWW(extractDomain(($"url")).alias("domain")), $"url", $"content")
.write.csv("/tuna1/scratch/nruest/auk_collection_testing/11989k/full-text")
// GraphML
val graph = webgraph.groupBy(
$"crawl_date",
removePrefixWWW(extractDomain($"src")).as("src_domain"),
removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
.count()
.filter(!($"dest_domain"===""))
.filter(!($"src_domain"===""))
.filter($"count" > 5)
.orderBy(desc("count"))
WriteGraphML(graph.collect(), "/tuna1/scratch/nruest/auk_collection_testing/11989/example.graphml")
Everything completed successfully.
Describe the bug
Job crashes on
java.lang.NumberFormatException
To Reproduce
Any of the three auk derivatives commands on these three ARCs:
Expected behavior
We should handle this exception better. Catch it an move on, or something better. The entire process shouldn't fail because of it.
Environment information