Closed ianmilligan1 closed 4 years ago
Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:
Do we really need all of these? I'd argue, at the very least, we can just remove WriteGraph
since it is redundant.
WriteGraph
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")
WriteGEXF
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf")
These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation.
$ wc -l *
29186 writegexf.gexf
29186 writegraph.gexf
58372 total
For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that WriteGraph
uses zipWithUniqueIds
and WriteGexf
& WriteGraphml
use ComputeMD5
. There are pros and cons. WriteGraph
is slower (@greebie thought 10-15% slower) but WriteGraph
has the chance of an MD5 hash collision.
Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one.
I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts.
FWIW I think we could also delete WriteGraphXML
- it looks to be a product of some of the GraphX experiments we were doing 2-3 years ago? reference
Describe the bug The output of the
CommandLineApp
DomainGraphExtractor
creates different node ID types than runningWriteGraph
directly through spark shell. They should be the same.To Reproduce The following command line command (both DF and RDD):
creates an output file that looks like:
Conversely, if we run this script as per aut-docs:
We get an output that looks like:
Expected behavior The output of
DomainGraphExtractor
is preferable to theWriteGraph
output. In other words, the nodes as hashes is superior to the notes as ID #s.Environment information
--jars