archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

update for 'src' column #424

Closed SinghGursimran closed 4 years ago

SinghGursimran commented 4 years ago

update for 'src' column

418

For Testing:


import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .webgraph()
            .select($"src")
            .keepUrlPatternsDF(Set(".*index.*".r))
            .show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .webgraph()
            .select($"src")
            .discardUrlPatternsDF(Set(".*images.*".r))
            .show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .webgraph()
            .select($"src")
            .keepUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
            .show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .webgraph()
            .select($"src")
            .discardUrlsDF(Set("http://www.archive.org/","http://www.archive.org/index.php"))
            .show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .imagegraph()
            .select($"src")
            .keepDomainsDF(Set("www.archive.org"))
            .show(10,false)

RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc)
            .webgraph()
            .select($"src")
            .discardDomainsDF(Set("www.archive.org"))
            .show(10,false)
codecov[bot] commented 4 years ago

Codecov Report

Merging #424 into master will not change coverage. The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #424   +/-   ##
=======================================
  Coverage   78.15%   78.15%           
=======================================
  Files          41       41           
  Lines        1584     1584           
  Branches      299      299           
=======================================
  Hits         1238     1238           
  Misses        218      218           
  Partials      128      128
ruebot commented 4 years ago

@SinghGursimran nice! I did a hasColumn function for a similar solution in twut. Can we get a test update too?

@lintool @ianmilligan1 do either of you see a use case for filtering on dest or image_url? Or is src, and url good enough here? If we add dest or image_url, we'd probably need to change the implementation to pass the column name as well.

ruebot commented 4 years ago

@SinghGursimran based on the new related issue (#425) let's just worry about getting the tests updated here, and don't worry about dest and image_url for now since the implementation of #425 would resolve that.

I'm running the right now on the entire GeoCities dataset for other project, and everything appears to be running smoothly :raised_hands: