Benchmarking Scala vs Python

ianmilligan1 commented 6 years ago

We are comparing the timings of Scala vs Python to help us make informed decisions on the migration.

greebie commented 6 years ago

import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._
import StringUtils._
import java.time.Instant

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
println("Get urls and count, taking 3.")
val r = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map (r => ExtractDomain(r.getUrl))
.countItems()
println(r.take(3).deep.mkString("\n"))
}

timed {
println("Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.")
val links = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)
println(links.take(3).deep.mkString("\n"))
}

timed {
println("Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.")
val crawlDateGroup = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
println(crawlDateGroup.take(3).deep.mkString("\n"))
}

timed {
println ("Extract text, taking 3 examples.")
val text = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
println(text.take(3).deep.mkString("\n"))
}

timed {
println ("Extract image urls, taking 3.")
val images = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
.countItems()
println (images.take(3).deep.mkString("\n"))
}

ruebot commented 6 years ago

@ianmilligan1 @greebie can you walk me through your plan for this testing? Bullet-list/number instructions are good. I see 6 things here in the script above. These are the Scala scripts I presume. Then we're doing the same again the Python equivalents?

ianmilligan1 commented 6 years ago

My understanding from my conversations with Ryan is the following:

Load Spark-Shell for Scala testing with x executors, y cores, z memory (we'll keep consistent between both Scala and Python)
Run the following basic set of scripts on collection (YMM-Fire to start, 90GB)
URL Counts (extracting domains in collections);
Hyperlink Extraction
Hyperlink Extraction by Date
Plain Text Extraction
Image URL Extraction
Record results, run this script 5 times and record results

Then

Load PySpark for testing with x executors, y cores, z memory (consistent with above)
Run the same set of scripts on collection
Record results, run the script 5 times and record results

greebie commented 6 years ago

Ian has it right. The only thing I would add is that we want to do it for small, medium and perhaps a large script, to see how they scale. (The large script may take too long and use up too many resources for what we have now.)

ruebot commented 6 years ago

These three are probably the best to test for small, medium, and large.

2G /shared/au/university-of-alberta-libraries-401/ottawa-shooting-october-2014-4957

90G /shared/au/university-of-alberta-libraries-401/fort-mcmurray-wildfire-2016-7368

828G /shared/au/university-of-alberta-libraries-401/prairie-provinces-2402

If you can provide me some sample output of the scala script above, and the Python script when you're ready, that'd be super helpful.

ianmilligan1 commented 6 years ago

Excellent, thanks @ruebot.

@greebie – can you run the Scala scripts on Ottawa and share the output here (via Gist if it's overly long).

I'm working on the Python scripts right now.

greebie commented 6 years ago

Results for YMM Fires and Example. All other aspects of the script check out (same output in each case).

Trial	Url count		Hyperlinks Structure		Dated Hyperlinks Structure
EXAMPLE	SCALA	PYTHON	SCALA	PYTHON	SCALA	PYTHON
trial 1	4214	8367	4313	15341	4579	12760
2	657	5006	1279	12594	1353	12487
3	618	3722	1164	12979	1128	12838
4	509	1145	1133	12415	1145	14471
5	577	2021	1190	13387	1547	12466

Trial	Extract Text		Image Urls
EXAMPLE	SCALA	PYTHON	SCALA	PYTHON
trial 1	520	533	846	10474
2	238	3721	1288	12829
3	129	336	688	9620
4	127	314	666	9806
5	232	287	870	9639

Trial	Url count		Hyperlinks Structure		Dated Hyperlinks Structure
OTTAWA SHOOTING	SCALA	PYTHON	SCALA	PYTHON	SCALA	PYTHON
trial 1	99439	116523	184099	2223305	4579	2237966
2	96016	116078	184658	2152270	187146	2196058
3	96875	113088	185923	2181362	188128	2221024
4	95334	116071	183133	2184187	181261	2251721
5	97436	109904	180737	2142623	178208	2223439

Trial	Extract Text		Image Urls
OTTAWA SHOOTING	SCALA	PYTHON	SCALA	PYTHON
trial 1	722	715	152187	1716.842
2	522	704	148542	1752192
3	315	576	148809	1745742
4	532	643	152651	1800866
5	302	530	143359	1745503

Trial	Url count		Hyperlinks Structure		Dated Hyperlinks Structure
YMM FIRES	SCALA	PYTHON	SCALA	PYTHON	SCALA	PYTHON
trial 1	268930	289017	276206	1114322	281291	1091865
2	222069	268609	283345	1115210	306255	1099671
3	214435	227713	293025	1107478	287533	1107078
4	205357	246269	279142	1106769	279851	1088578
5	233336	210771	266573	1109853	286223	1115048

Trial	Extract Text		Image Urls
YMM FIRES	SCALA	PYTHON	SCALA	PYTHON
trial 1	262	570	254171	948225
2	240	884	247527	939791
3	197	431	250581	958162
4	238	394	251905	968368
5	285	452	303458	952961

ruebot commented 6 years ago

@greebie I was looking for the raw output of the script.

greebie commented 6 years ago

Output here: https://gist.github.com/greebie/e93dae5ba0869de43ef1d635c5bad0ce

If you are looking for a --verbose output let me know. I am running a big job right now unfortunately, but will get one out ASAP.

ianmilligan1 commented 6 years ago

Python instructions are at https://gist.github.com/ianmilligan1/1436be06a5d2293bf3b6447493c962c3.

greebie commented 6 years ago

Encountered this lovely problem: https://www.thoughtvector.io/blog/python-3-on-spark-return-of-the-pythonhashseed/

I am going to try and avoid using dictionary-like processes for now. I think we should remove dictionary-like processes like .countByValue() from the docs as well.

Solution is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0. This is a problem with Spark 2.1.1 and will be fixed in Spark 2.2.

greebie commented 6 years ago

Would appreciate a review of the following code to test against the above code before I start running the scripts at scale. Thanks!

import RecordLoader
from DFTransformations import *
from ExtractDomain import ExtractDomain
from ExtractLinks import ExtractLinks
from ExtractDate import DateComponent
from ExtractImageLinks import ExtractImageLinks
from RemoveHTML import RemoveHTML
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# replace with your own path to archive file
path = "/shared/au/example.arc.gz"
spark = SparkSession.builder.appName("filterByDate").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")

import time

def timed(func):
    start = time.time()
    func()
    end = time.time()
    print("Elapsed Time: " + str((end - start)*1000))

def test1 ():
    print("Get urls and count, taking 3.")
    df = RecordLoader.loadArchivesAsDF(path, sc, spark)
    df.groupBy("domain").count().sort(desc("count")).show(n=3)

def test2():
    print("Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.")
    rdd = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
          .flatMap(lambda r: ExtractLinks(r.url, r.contentString))\
          .map(lambda r: (ExtractDomain(r[0]), ExtractDomain(r[1])))\
          .filter(lambda r: r[0] is not None and r[0]!= "" and r[1] is not None and r[1] != "")
    print(countItems(rdd).filter(lambda r: r[1] > 5).take(3))

def test3():
    print ("Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.")
    df = RecordLoader.loadArchivesAsDF(path, sc, spark)
    fdf = df.select(df['crawlDate'], df['url'], df['contentString'])
    rdd = fdf.rdd
    rddx = rdd.map (lambda r: (r.crawlDate, ExtractLinks(r.url, r.contentString)))\
     .flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
     .filter(lambda r: r[-1] != None)\
     .map(lambda r: (r[0], re.sub(r'^.*www.', '', r[1]), re.sub(r'^.*www.', '', r[2])))
    print(countItems(rddx).take(3))

def test4():
    print("Extract text, taking 3 examples.")
    text = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
      .map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString)))
    print(text.take(3))

def test5():
    print ("Extract image urls, taking 3.")
    images = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
      .flatMap(lambda r: ExtractImageLinks(r.url, r.contentString))
    print(countItems(images).take(3))

if __name__ == '__main__':
    for _ in range(5):
        timed(test1)
        timed(test2)
        timed(test3)
        timed(test4)
        timed(test5)

greebie commented 6 years ago

The script produces:

Get urls and count, taking 3.
+------------------+-----+
|            domain|count|
+------------------+-----+
|   www.archive.org|  132|
|     deadlists.com|    2|
|www.hideout.com.br|    1|
+------------------+-----+

Elapsed Time: 8.73497486114502
Get Hyperlinks from text and site and count, filtering out counts < 5,  taking 3.
[(('www.archive.org', 'www.archive.org'), 304), (('www.archive.org', 'wiki.etree.org'), 21), (('www.archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 15.355172872543335
Get links from text and site, group by date and count, filtering out counts < 5,  taking 3.
[(('20080430', 'archive.org', 'archive.org'), 316), (('20080430', 'archive.org', 'wiki.etree.org'), 21), (('20080430', 'archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 12.715660572052002
Extract text, taking 3 examples.
[('20080430', 'www.archive.org', 'http://www.archive.org/', ' document.location="http://www.archive.org/index.php"; Please visit our website at: http://www.archive.org '), ('20080430', 'www.archive.org', 'http://www.archive.org/index.php', '  Internet Archive Web |  Moving Images |  Texts |  Audio |  Software |  Education |  Patron Info |  About IA  Forums | FAQs | Contributions | Jobs | Donate Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children\'s Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs   UploadAnonymous User (login or join us)\xa0 \xa0\xa0 Announcements (more)   Free Ultra High-Speed Internet to Public Housing    Rise of the HighTech Non-Profits    Zotero and Internet Archive join forces  \xa0\xa0 Web85 billion pages Advanced Search \xa0\xa0 Welcome to the Archive The Internet Archive is building a digital library of Internet    sites and other cultural artifacts in digital form. Like a paper    library, we provide free access to researchers, historians,    scholars, and the general public. \xa0\xa0 \xa0\xa0 Moving Images\xa0115,646 movies Browse\xa0\xa0                      (by keyword) \xa0\xa0 Live Music Archive\xa048,893 concerts Browse\xa0\xa0                      (by band) \xa0\xa0 Audio\xa0250,854 recordings Browse\xa0\xa0                      (by keyword) \xa0\xa0 Texts\xa0395,004 texts Browse\xa0\xa0                      (by keyword) \xa0\xa0 \xa0\xa0 Curator\'s Choice (more)  A Few Good G-MenRandall Glass, the maker of "Warthog Jump," re-creates in "A Few Good G-Men" an entire scene from... \xa0\xa0 Curator\'s Choice (more)  Grateful Dead Live at Nashville Municipal...Set 1 Sugaree Beat It On Down The Line Candyman Me And My Uncle -> Big River Stagger Lee Looks Like... \xa0\xa0 Curator\'s Choice (more)  Zanstones - Slaakhuis: Live in Rotterdam, HollandZanstones confuses the dutch masses with this live display of wacked rhythms, whacked vocals, and... \xa0\xa0 Curator\'s Choice (more)  Secret armies; the new technique of Nazi warfare \xa0\xa0 \xa0\xa0 Recent Reviews Code4Lib 2008: Can Resource Description become Rigorous Data?Average rating:  Madonna adopts African baby.Average rating:  \xa0\xa0 Recent Reviews Carolina Chocolate Drops Live at MerleFest on 2007-04-27Average rating:  Grateful Dead Live at Oakland-Alameda County Coliseum on 1988-12-28Average rating:  \xa0\xa0 Recent Reviews No ThoroughfareAverage rating:  JAHTARI RIDDIM FORCE - Farmer In The Sky / Depth ChargeAverage rating:  \xa0\xa0 Recent Reviews A manual of chemical analysis, qualitative and quantitativeAverage rating:  Chemical lecture experiments; non-metallic elementsAverage rating:  \xa0\xa0 \xa0\xa0 Most recent posts (write a post by going to a forum) more...  Subject Poster Forum RepliesViewsDate   Re: Making a mix for a chick I know...  William Tell  GratefulDead  0  6  20 minutes ago    Re: Bob\'s shorts not going into archives  BobsShortShorts  GratefulDead  0  9  26 minutes ago    Re: Thanks to All  airgarcia416  GratefulDead  0  5  26 minutes ago    Re: Bob\'s shorts not going into archives  sydthecat2  GratefulDead  0  8  36 minutes ago    Re: What is the worst-reviewed feature film on IA?  RipJarvis  feature_films  0  9  50 minutes ago    Re: Playin\' In The Band...all day and all night  sydthecat2  GratefulDead  0  11  58 minutes ago    Re: Playin\' In The Band...all day and all night  rastamon  GratefulDead  0  16  1 hour ago    Re: Making a mix for a chick I know...  caspersvapors  GratefulDead  1  11  1 hour ago    Re: Bob\'s shorts not going into archives  rastamon  GratefulDead  0  11  1 hour ago    Re: Bob\'s shorts not going into archives  bluedevil  GratefulDead  1  13  1 hour ago  \xa0\xa0 \xa0Institutional Support Alexa Internet HP Computer The Kahle/Austin Foundation Prelinger Archives National Science Foundation Library of Congress LizardTech Sloan Foundation Individual      contributors \xa0 Skin: classic | columns | custom! Terms of Use (10 Mar 2001) '), ('20080430', 'www.archive.org', 'http://www.archive.org/details/DrinkingWithBob-MadonnaAdoptsAfricanBaby887', " Internet Archive: Details: Madonna adopts African baby. Web |  Moving Images |  Texts |  Audio |  Software |  Education |  Patron Info |  About IA    Home Animation & Cartoons |  Arts & Music |  Computers & Technology |  Cultural & Academic Films |  Ephemeral Films |  Movies |  News & Public Affairs |  Non-English Videos |  Open Source Movies |  Prelinger Archives |  Sports Videos |  Video Games |  Vlogs |  Youth Media Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children's Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs Advanced Search UploadAnonymous User (login or join us)\xa0    .grayBack { background-color: #D8DEDE; }   .lightBack { background-color: #339933; }   .lightBorder { border: 2px solid #339933; }   .darkBack { background-color: #115500; }   .darkBorder { border: 2px solid #115500; }   .darkFore { color: #115500; }   h1 { background-color: #115500; }   h3 { background-color: #339933; }   h2 { background-color: #D8DEDE;        color: #115500; } View movie View thumbnails       Run time: 00:01:37Play / Download          (help)     Quicktime     (1.3 MB)   All files: FTP HTTPResources       Bookmark       Report errorsMadonna adopts African baby.             Internet Archive's in-browser video player requires              JavaScript to be enabled.  It appears your browser does not have it             turned on.  Please see your browser settings for this feature.            embedding and helpMadonna is an arrogant, publicity hungry, piece of trash!!!This item is part of the collection: blip.tv             Write a review                Reviews          Downloaded       61        times          Average Rating:     Reviewer: _sprout -            -            April 27, 2008Subject: Madonna is a washed up hag trying to keep her name in the papers+5 stars because I agree with your general statement that these 'exotic' kids are like exotic pets for rich people and celebs to show off.-2 stars because this sort of thing is better suited to Youtube.Reviewer: XXXmoan -            -            April 25, 2008Subject: are You freakin SeriousWhat the fuck! who cares if she goes to adopt an african, thats none of your business. you need to chill like that other bitch who claims that she doesn't care that madonna fell off a horse. so my question is.............................................................................what the fuck your problem Terms of Use (10 Mar 2001) ")]
Elapsed Time: 0.37613391876220703
Extract image urls, taking 3.
[('http://www.archive.org/images/star.png', 408), ('http://www.archive.org/images/no_star.png', 122), ('http://www.archive.org/images/logo.jpg', 118)]
Elapsed Time: 9.878446817398071

greebie commented 6 years ago

First test samples updated above. Seems like Tests nos. 2 & 3 are taking longer than expected. Not sure if this is a design issue or scale issue. Will need to look at these at scale -- will upload soon.

ianmilligan1 commented 6 years ago

This all looks good to me, Ryan, seems consistent with the Scala ones. Looking forward to your thoughts!

ianmilligan1 commented 6 years ago

Benchmarking revealed #130 – @greebie and I have been looking into fixing these serious performance issues.

greebie commented 6 years ago

Closing for now as dependent on a resolution to #130. Once #130's done, we can redo the numbers and see what changes.

eromoe commented 4 years ago

Did the python version include pyarrow and pandas_udf ?? I saw many articles said pyspark boost under those setup .

archivesunleashed / aut

Benchmarking Scala vs Python #121