archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Closed ianmilligan1 closed 4 years ago

ianmilligan1 commented 4 years ago

Describe the bug The output of the CommandLineApp DomainGraphExtractor creates different node ID types than running WriteGraph directly through spark shell. They should be the same.

To Reproduce The following command line command (both DF and RDD):

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1

creates an output file that looks like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Conversely, if we run this script as per aut-docs:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/users/ianmilligan1/desktop/script-gexf.gexf")

We get an output that looks like:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

Expected behavior The output of DomainGraphExtractor is preferable to the WriteGraph output. In other words, the nodes as hashes is superior to the notes as ID #s.

Environment information

ruebot commented 4 years ago

Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:

Do we really need all of these? I'd argue, at the very least, we can just remove WriteGraph since it is redundant.

WriteGraph

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")

WriteGEXF

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf")

These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation.

$ wc -l *              
  29186 writegexf.gexf
  29186 writegraph.gexf
  58372 total
ianmilligan1 commented 4 years ago

For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that WriteGraph uses zipWithUniqueIds and WriteGexf & WriteGraphml use ComputeMD5. There are pros and cons. WriteGraph is slower (@greebie thought 10-15% slower) but WriteGraph has the chance of an MD5 hash collision.

Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one.

I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts.

FWIW I think we could also delete WriteGraphXML - it looks to be a product of some of the GraphX experiments we were doing 2-3 years ago? reference