archivesunleashed / aut-docs

AUT documentation
https://aut.docs.archivesunleashed.org/
2 stars 2 forks source link

Broken Scripts on Filter-DF #77

Closed ianmilligan1 closed 4 years ago

ianmilligan1 commented 4 years ago

Two broken scripts.

This one:

import io.archivesunleashed._
import io.archivesunleashed.udfs._

val languages = Array("th","de","ht")

RecordLoader.loadArchives("/path/to/warcs",sc)
  .webpages()
  .select($"language", $"url", $"content")
  .filter($"language".isin(languages))

Leads to

org.apache.spark.sql.AnalysisException: cannot resolve '(`language` IN ([th,de]))' due to data type mismatch: Arguments must be same type but were: string != array<string>;;
'Filter language#59 IN ([th,de])
+- Project [language#59, url#56, content#60]
   +- LogicalRDD [crawl_date#55, url#56, mime_type_web_server#57, mime_type_tika#58, language#59, content#60], false

And on the same page, this Python script

from aut import *
from pyspark.sql.functions import col

urls = ["www.archive.org"]

WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz") \
  .all() \
  .select("url", "content") \
  .filter(~col("url").isin(urls)

leads to

  File "<ipython-input-4-e1e43f4bf7e2>", line 5
    WebArchive(sc, sqlContext, "/Users/ianmilligan1/dropbox/git/aut-resources/Sample-Data/*.gz")   .all()   .select("url", "content")   .filter(~col("url").isin(urls)
                                                                                                                                                                      ^
SyntaxError: unexpected EOF while parsing
ruebot commented 4 years ago

First script should be using hasLanguages. .filter(!hasContent($"language", lit(languages))) should do it.

Second script is missing a closing parentheses at the end of the filter line.

.filter(~col("url").isin(urls))

ruebot commented 4 years ago

Resolved with https://github.com/archivesunleashed/aut-docs/commit/9c144978139ce38a8b847134b5f57341f9651e04