archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python. #466

Closed ruebot closed 4 years ago

ruebot commented 4 years ago

GitHub issue(s): #409

What does this Pull Request do?

Add ExtractPopularImages, WriteGEXF, and WriteGraphML to Python.

How should this be tested?

[nruest@bomba:/tmp]$ wc -l scala-example.gexf ~/Projects/au/sample-data/notebook-testing/test.gexf
  29186 scala-example.gexf
  29186 /home/nruest/Projects/au/sample-data/notebook-testing/test.gexf
  58372 total
[nruest@bomba:/tmp]$ wc -l scala-example.graphml ~/Projects/au/sample-data/notebook-testing/test.graphml
  33911 scala-example.graphml
  33911 /home/nruest/Projects/au/sample-data/notebook-testing/test.graphml
  67822 total
codecov[bot] commented 4 years ago

Codecov Report

Merging #466 into master will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #466   +/-   ##
=======================================
  Coverage   75.86%   75.86%           
=======================================
  Files          49       49           
  Lines        1442     1442           
  Branches      279      279           
=======================================
  Hits         1094     1094           
  Misses        218      218           
  Partials      130      130           
ruebot commented 4 years ago

Here's the documentation update PR for when the time comes.

ruebot commented 4 years ago

Last item here is Extract Entities. We only have an RDD implementation of that. I'd argue that NER is something down the chain folks can do with the derivatives, and we don't need to implement it in the Scala DataFrame or Python side.

...which means, we could go as far as completely removing all the NER functionality. Though, not 100% wedded to that idea.

ruebot commented 4 years ago

Once we merge, I'm move those two notebooks over to the notebooks repo.