archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Benchmarking Scala vs Python #121

Closed ianmilligan1 closed 6 years ago

ianmilligan1 commented 6 years ago

We are comparing the timings of Scala vs Python to help us make informed decisions on the migration.

greebie commented 6 years ago
import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._
import StringUtils._
import java.time.Instant

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
println("Get urls and count, taking 3.")
val r = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map (r => ExtractDomain(r.getUrl))
.countItems()
println(r.take(3).deep.mkString("\n"))
}

timed {
println("Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.")
val links = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)
println(links.take(3).deep.mkString("\n"))
}

timed {
println("Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.")
val crawlDateGroup = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
println(crawlDateGroup.take(3).deep.mkString("\n"))
}

timed {
println ("Extract text, taking 3 examples.")
val text = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
println(text.take(3).deep.mkString("\n"))
}

timed {
println ("Extract image urls, taking 3.")
val images = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
.countItems()
println (images.take(3).deep.mkString("\n"))
}
ruebot commented 6 years ago

@ianmilligan1 @greebie can you walk me through your plan for this testing? Bullet-list/number instructions are good. I see 6 things here in the script above. These are the Scala scripts I presume. Then we're doing the same again the Python equivalents?

ianmilligan1 commented 6 years ago

My understanding from my conversations with Ryan is the following:

Then

greebie commented 6 years ago

Ian has it right. The only thing I would add is that we want to do it for small, medium and perhaps a large script, to see how they scale. (The large script may take too long and use up too many resources for what we have now.)

ruebot commented 6 years ago

These three are probably the best to test for small, medium, and large.

2G /shared/au/university-of-alberta-libraries-401/ottawa-shooting-october-2014-4957

90G /shared/au/university-of-alberta-libraries-401/fort-mcmurray-wildfire-2016-7368

828G /shared/au/university-of-alberta-libraries-401/prairie-provinces-2402

If you can provide me some sample output of the scala script above, and the Python script when you're ready, that'd be super helpful.

ianmilligan1 commented 6 years ago

Excellent, thanks @ruebot.

@greebie – can you run the Scala scripts on Ottawa and share the output here (via Gist if it's overly long).

I'm working on the Python scripts right now.

greebie commented 6 years ago

Results for YMM Fires and Example. All other aspects of the script check out (same output in each case).

Trial Url count Hyperlinks Structure Dated Hyperlinks Structure
EXAMPLE SCALA PYTHON SCALA PYTHON SCALA PYTHON
trial 1 4214 8367 4313 15341 4579 12760
2 657 5006 1279 12594 1353 12487
3 618 3722 1164 12979 1128 12838
4 509 1145 1133 12415 1145 14471
5 577 2021 1190 13387 1547 12466
Trial Extract Text Image Urls
EXAMPLE SCALA PYTHON SCALA PYTHON
trial 1 520 533 846 10474
2 238 3721 1288 12829
3 129 336 688 9620
4 127 314 666 9806
5 232 287 870 9639
Trial Url count Hyperlinks Structure Dated Hyperlinks Structure
OTTAWA SHOOTING SCALA PYTHON SCALA PYTHON SCALA PYTHON
trial 1 99439 116523 184099 2223305 4579 2237966
2 96016 116078 184658 2152270 187146 2196058
3 96875 113088 185923 2181362 188128 2221024
4 95334 116071 183133 2184187 181261 2251721
5 97436 109904 180737 2142623 178208 2223439
Trial Extract Text Image Urls
OTTAWA SHOOTING SCALA PYTHON SCALA PYTHON
trial 1 722 715 152187 1716.842
2 522 704 148542 1752192
3 315 576 148809 1745742
4 532 643 152651 1800866
5 302 530 143359 1745503
Trial Url count Hyperlinks Structure Dated Hyperlinks Structure
YMM FIRES SCALA PYTHON SCALA PYTHON SCALA PYTHON
trial 1 268930 289017 276206 1114322 281291 1091865
2 222069 268609 283345 1115210 306255 1099671
3 214435 227713 293025 1107478 287533 1107078
4 205357 246269 279142 1106769 279851 1088578
5 233336 210771 266573 1109853 286223 1115048
Trial Extract Text Image Urls
YMM FIRES SCALA PYTHON SCALA PYTHON
trial 1 262 570 254171 948225
2 240 884 247527 939791
3 197 431 250581 958162
4 238 394 251905 968368
5 285 452 303458 952961
ruebot commented 6 years ago

@greebie I was looking for the raw output of the script.

greebie commented 6 years ago

Output here: https://gist.github.com/greebie/e93dae5ba0869de43ef1d635c5bad0ce

If you are looking for a --verbose output let me know. I am running a big job right now unfortunately, but will get one out ASAP.

ianmilligan1 commented 6 years ago

Python instructions are at https://gist.github.com/ianmilligan1/1436be06a5d2293bf3b6447493c962c3.

greebie commented 6 years ago

Encountered this lovely problem: https://www.thoughtvector.io/blog/python-3-on-spark-return-of-the-pythonhashseed/

I am going to try and avoid using dictionary-like processes for now. I think we should remove dictionary-like processes like .countByValue() from the docs as well.

Solution is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0. This is a problem with Spark 2.1.1 and will be fixed in Spark 2.2.

greebie commented 6 years ago

Would appreciate a review of the following code to test against the above code before I start running the scripts at scale. Thanks!

import RecordLoader
from DFTransformations import *
from ExtractDomain import ExtractDomain
from ExtractLinks import ExtractLinks
from ExtractDate import DateComponent
from ExtractImageLinks import ExtractImageLinks
from RemoveHTML import RemoveHTML
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# replace with your own path to archive file
path = "/shared/au/example.arc.gz"
spark = SparkSession.builder.appName("filterByDate").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")

import time

def timed(func):
    start = time.time()
    func()
    end = time.time()
    print("Elapsed Time: " + str((end - start)*1000))

def test1 ():
    print("Get urls and count, taking 3.")
    df = RecordLoader.loadArchivesAsDF(path, sc, spark)
    df.groupBy("domain").count().sort(desc("count")).show(n=3)

def test2():
    print("Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.")
    rdd = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
          .flatMap(lambda r: ExtractLinks(r.url, r.contentString))\
          .map(lambda r: (ExtractDomain(r[0]), ExtractDomain(r[1])))\
          .filter(lambda r: r[0] is not None and r[0]!= "" and r[1] is not None and r[1] != "")
    print(countItems(rdd).filter(lambda r: r[1] > 5).take(3))

def test3():
    print ("Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.")
    df = RecordLoader.loadArchivesAsDF(path, sc, spark)
    fdf = df.select(df['crawlDate'], df['url'], df['contentString'])
    rdd = fdf.rdd
    rddx = rdd.map (lambda r: (r.crawlDate, ExtractLinks(r.url, r.contentString)))\
     .flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
     .filter(lambda r: r[-1] != None)\
     .map(lambda r: (r[0], re.sub(r'^.*www.', '', r[1]), re.sub(r'^.*www.', '', r[2])))
    print(countItems(rddx).take(3))

def test4():
    print("Extract text, taking 3 examples.")
    text = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
      .map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString)))
    print(text.take(3))

def test5():
    print ("Extract image urls, taking 3.")
    images = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
      .flatMap(lambda r: ExtractImageLinks(r.url, r.contentString))
    print(countItems(images).take(3))

if __name__ == '__main__':
    for _ in range(5):
        timed(test1)
        timed(test2)
        timed(test3)
        timed(test4)
        timed(test5)
greebie commented 6 years ago

The script produces:

Get urls and count, taking 3.
+------------------+-----+
|            domain|count|
+------------------+-----+
|   www.archive.org|  132|
|     deadlists.com|    2|
|www.hideout.com.br|    1|
+------------------+-----+

Elapsed Time: 8.73497486114502
Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.
[(('www.archive.org', 'www.archive.org'), 304), (('www.archive.org', 'wiki.etree.org'), 21), (('www.archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 15.355172872543335
Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.
[(('20080430', 'archive.org', 'archive.org'), 316), (('20080430', 'archive.org', 'wiki.etree.org'), 21), (('20080430', 'archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 12.715660572052002
Extract text, taking 3 examples.
[('20080430', 'www.archive.org', 'http://www.archive.org/', ' document.location="http://www.archive.org/index.php"; Please visit our website at: http://www.archive.org '), ('20080430', 'www.archive.org', 'http://www.archive.org/index.php', '  Internet Archive Web |  Moving Images |  Texts |  Audio |  Software |  Education |  Patron Info |  About IA  Forums | FAQs | Contributions | Jobs | Donate Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children\'s Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs   UploadAnonymous User (login or join us)\xa0 \xa0\xa0 Announcements (more)   Free Ultra High-Speed Internet to Public Housing    Rise of the HighTech Non-Profits    Zotero and Internet Archive join forces  \xa0\xa0 Web85 billion pages Advanced Search \xa0\xa0 Welcome to the Archive The Internet Archive is building a digital library of Internet    sites and other cultural artifacts in digital form. Like a paper    library, we provide free access to researchers, historians,    scholars, and the general public. \xa0\xa0 \xa0\xa0 Moving Images\xa0115,646 movies Browse\xa0\xa0                      (by keyword) \xa0\xa0 Live Music Archive\xa048,893 concerts Browse\xa0\xa0                      (by band) \xa0\xa0 Audio\xa0250,854 recordings Browse\xa0\xa0                      (by keyword) \xa0\xa0 Texts\xa0395,004 texts Browse\xa0\xa0                      (by keyword) \xa0\xa0 \xa0\xa0 Curator\'s Choice (more)  A Few Good G-MenRandall Glass, the maker of "Warthog Jump," re-creates in "A Few Good G-Men" an entire scene from... \xa0\xa0 Curator\'s Choice (more)  Grateful Dead Live at Nashville Municipal...Set 1 Sugaree Beat It On Down The Line Candyman Me And My Uncle -> Big River Stagger Lee Looks Like... \xa0\xa0 Curator\'s Choice (more)  Zanstones - Slaakhuis: Live in Rotterdam, HollandZanstones confuses the dutch masses with this live display of wacked rhythms, whacked vocals, and... \xa0\xa0 Curator\'s Choice (more)  Secret armies; the new technique of Nazi warfare \xa0\xa0 \xa0\xa0 Recent Reviews Code4Lib 2008: Can Resource Description become Rigorous Data?Average rating:  Madonna adopts African baby.Average rating:  \xa0\xa0 Recent Reviews Carolina Chocolate Drops Live at MerleFest on 2007-04-27Average rating:  Grateful Dead Live at Oakland-Alameda County Coliseum on 1988-12-28Average rating:  \xa0\xa0 Recent Reviews No ThoroughfareAverage rating:  JAHTARI RIDDIM FORCE - Farmer In The Sky / Depth ChargeAverage rating:  \xa0\xa0 Recent Reviews A manual of chemical analysis, qualitative and quantitativeAverage rating:  Chemical lecture experiments; non-metallic elementsAverage rating:  \xa0\xa0 \xa0\xa0 Most recent posts (write a post by going to a forum) more...  Subject Poster Forum RepliesViewsDate   Re: Making a mix for a chick I know...  William Tell  GratefulDead  0  6  20 minutes ago    Re: Bob\'s shorts not going into archives  BobsShortShorts  GratefulDead  0  9  26 minutes ago    Re: Thanks to All  airgarcia416  GratefulDead  0  5  26 minutes ago    Re: Bob\'s shorts not going into archives  sydthecat2  GratefulDead  0  8  36 minutes ago    Re: What is the worst-reviewed feature film on IA?  RipJarvis  feature_films  0  9  50 minutes ago    Re: Playin\' In The Band...all day and all night  sydthecat2  GratefulDead  0  11  58 minutes ago    Re: Playin\' In The Band...all day and all night  rastamon  GratefulDead  0  16  1 hour ago    Re: Making a mix for a chick I know...  caspersvapors  GratefulDead  1  11  1 hour ago    Re: Bob\'s shorts not going into archives  rastamon  GratefulDead  0  11  1 hour ago    Re: Bob\'s shorts not going into archives  bluedevil  GratefulDead  1  13  1 hour ago  \xa0\xa0 \xa0Institutional Support Alexa Internet HP Computer The Kahle/Austin Foundation Prelinger Archives National Science Foundation Library of Congress LizardTech Sloan Foundation Individual      contributors \xa0 Skin: classic | columns | custom! Terms of Use (10 Mar 2001) '), ('20080430', 'www.archive.org', 'http://www.archive.org/details/DrinkingWithBob-MadonnaAdoptsAfricanBaby887', " Internet Archive: Details: Madonna adopts African baby. Web |  Moving Images |  Texts |  Audio |  Software |  Education |  Patron Info |  About IA    Home Animation & Cartoons |  Arts & Music |  Computers & Technology |  Cultural & Academic Films |  Ephemeral Films |  Movies |  News & Public Affairs |  Non-English Videos |  Open Source Movies |  Prelinger Archives |  Sports Videos |  Video Games |  Vlogs |  Youth Media Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children's Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs Advanced Search UploadAnonymous User (login or join us)\xa0    .grayBack { background-color: #D8DEDE; }   .lightBack { background-color: #339933; }   .lightBorder { border: 2px solid #339933; }   .darkBack { background-color: #115500; }   .darkBorder { border: 2px solid #115500; }   .darkFore { color: #115500; }   h1 { background-color: #115500; }   h3 { background-color: #339933; }   h2 { background-color: #D8DEDE;        color: #115500; } View movie View thumbnails       Run time: 00:01:37Play / Download          (help)     Quicktime     (1.3 MB)   All files: FTP HTTPResources       Bookmark       Report errorsMadonna adopts African baby.             Internet Archive's in-browser video player requires              JavaScript to be enabled.  It appears your browser does not have it             turned on.  Please see your browser settings for this feature.            embedding and helpMadonna is an arrogant, publicity hungry, piece of trash!!!This item is part of the collection: blip.tv             Write a review                Reviews          Downloaded       61        times          Average Rating:     Reviewer: _sprout -            -            April 27, 2008Subject: Madonna is a washed up hag trying to keep her name in the papers+5 stars because I agree with your general statement that these 'exotic' kids are like exotic pets for rich people and celebs to show off.-2 stars because this sort of thing is better suited to Youtube.Reviewer: XXXmoan -            -            April 25, 2008Subject: are You freakin SeriousWhat the fuck! who cares if she goes to adopt an african, thats none of your business. you need to chill like that other bitch who claims that she doesn't care that madonna fell off a horse. so my question is.............................................................................what the fuck your problem Terms of Use (10 Mar 2001) ")]
Elapsed Time: 0.37613391876220703
Extract image urls, taking 3.
[('http://www.archive.org/images/star.png', 408), ('http://www.archive.org/images/no_star.png', 122), ('http://www.archive.org/images/logo.jpg', 118)]
Elapsed Time: 9.878446817398071
greebie commented 6 years ago

First test samples updated above. Seems like Tests nos. 2 & 3 are taking longer than expected. Not sure if this is a design issue or scale issue. Will need to look at these at scale -- will upload soon.

ianmilligan1 commented 6 years ago

This all looks good to me, Ryan, seems consistent with the Scala ones. Looking forward to your thoughts!

ianmilligan1 commented 6 years ago

Benchmarking revealed #130 – @greebie and I have been looking into fixing these serious performance issues.

greebie commented 6 years ago

Closing for now as dependent on a resolution to #130. Once #130's done, we can redo the numbers and see what changes.

eromoe commented 4 years ago

Did the python version include pyarrow and pandas_udf ?? I saw many articles said pyspark boost under those setup .