archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Implement Scala Matchbox UDFs in Python. #463

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

GitHub issue(s):

What does this Pull Request do?

Implement Scala Matchbox UDFs in Python.

How should this be tested?

Additional Notes:

  1. I made a number of structural changes to the Scala side. @lintool, please let me know if you take strong issue with anything.

  2. I'm going to punt on the hasX filters for right now, and loop back around to them. I hit a wall with trying to get them to run in PySpark, and part of me is tempted to just say that we should go with the natural PySpark (Python) implementation here. Basically:

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date")\
  .show(10, True)

Instead of

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

or

from aut import *

languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/data")\
  .webpages()\
  .filter(~Udf.has_language("language", languages))\
  .select("crawl_date")\
  .show(10, True)

Basically, an argument I made in #425.

codecov[bot] commented 4 years ago

Codecov Report

Merging #463 into master will decrease coverage by 0.05%. The diff coverage is 94.54%.

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   76.49%   76.43%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        1459     1460       +1     
  Branches      279      279              
==========================================
  Hits         1116     1116              
- Misses        213      214       +1     
  Partials      130      130              
ruebot commented 4 years ago

Yeah, I could clean that notebook up, and toss it in https://github.com/archivesunleashed/notebooks when we're done.