Closed ianmilligan1 closed 6 years ago
import io.archivesunleashed.spark.matchbox._
import io.archivesunleashed.spark.rdd.RecordRDD._
import StringUtils._
import java.time.Instant
def timed(f: => Unit) = {
val start = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
println("Elapsed Time: " + (end - start))
}
timed {
println("Get urls and count, taking 3.")
val r = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map (r => ExtractDomain(r.getUrl))
.countItems()
println(r.take(3).deep.mkString("\n"))
}
timed {
println("Get Hyperlinks from text and site and count, filtering out counts < 5, taking 3.")
val links = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)
println(links.take(3).deep.mkString("\n"))
}
timed {
println("Get links from text and site, group by date and count, filtering out counts < 5, taking 3.")
val crawlDateGroup = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
println(crawlDateGroup.take(3).deep.mkString("\n"))
}
timed {
println ("Extract text, taking 3 examples.")
val text = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
println(text.take(3).deep.mkString("\n"))
}
timed {
println ("Extract image urls, taking 3.")
val images = RecordLoader.loadArchives("/shared/au/example.arc.gz", sc)
.flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
.countItems()
println (images.take(3).deep.mkString("\n"))
}
@ianmilligan1 @greebie can you walk me through your plan for this testing? Bullet-list/number instructions are good. I see 6 things here in the script above. These are the Scala scripts I presume. Then we're doing the same again the Python equivalents?
My understanding from my conversations with Ryan is the following:
Then
Ian has it right. The only thing I would add is that we want to do it for small, medium and perhaps a large script, to see how they scale. (The large script may take too long and use up too many resources for what we have now.)
These three are probably the best to test for small, medium, and large.
2G
/shared/au/university-of-alberta-libraries-401/ottawa-shooting-october-2014-4957
90G
/shared/au/university-of-alberta-libraries-401/fort-mcmurray-wildfire-2016-7368
828G
/shared/au/university-of-alberta-libraries-401/prairie-provinces-2402
If you can provide me some sample output of the scala script above, and the Python script when you're ready, that'd be super helpful.
Excellent, thanks @ruebot.
@greebie – can you run the Scala scripts on Ottawa and share the output here (via Gist if it's overly long).
I'm working on the Python scripts right now.
Results for YMM Fires and Example. All other aspects of the script check out (same output in each case).
Trial | Url count | Hyperlinks Structure | Dated Hyperlinks Structure | |||
---|---|---|---|---|---|---|
EXAMPLE | SCALA | PYTHON | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 4214 | 8367 | 4313 | 15341 | 4579 | 12760 |
2 | 657 | 5006 | 1279 | 12594 | 1353 | 12487 |
3 | 618 | 3722 | 1164 | 12979 | 1128 | 12838 |
4 | 509 | 1145 | 1133 | 12415 | 1145 | 14471 |
5 | 577 | 2021 | 1190 | 13387 | 1547 | 12466 |
Trial | Extract Text | Image Urls | ||
---|---|---|---|---|
EXAMPLE | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 520 | 533 | 846 | 10474 |
2 | 238 | 3721 | 1288 | 12829 |
3 | 129 | 336 | 688 | 9620 |
4 | 127 | 314 | 666 | 9806 |
5 | 232 | 287 | 870 | 9639 |
Trial | Url count | Hyperlinks Structure | Dated Hyperlinks Structure | |||
---|---|---|---|---|---|---|
OTTAWA SHOOTING | SCALA | PYTHON | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 99439 | 116523 | 184099 | 2223305 | 4579 | 2237966 |
2 | 96016 | 116078 | 184658 | 2152270 | 187146 | 2196058 |
3 | 96875 | 113088 | 185923 | 2181362 | 188128 | 2221024 |
4 | 95334 | 116071 | 183133 | 2184187 | 181261 | 2251721 |
5 | 97436 | 109904 | 180737 | 2142623 | 178208 | 2223439 |
Trial | Extract Text | Image Urls | ||
---|---|---|---|---|
OTTAWA SHOOTING | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 722 | 715 | 152187 | 1716.842 |
2 | 522 | 704 | 148542 | 1752192 |
3 | 315 | 576 | 148809 | 1745742 |
4 | 532 | 643 | 152651 | 1800866 |
5 | 302 | 530 | 143359 | 1745503 |
Trial | Url count | Hyperlinks Structure | Dated Hyperlinks Structure | |||
---|---|---|---|---|---|---|
YMM FIRES | SCALA | PYTHON | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 268930 | 289017 | 276206 | 1114322 | 281291 | 1091865 |
2 | 222069 | 268609 | 283345 | 1115210 | 306255 | 1099671 |
3 | 214435 | 227713 | 293025 | 1107478 | 287533 | 1107078 |
4 | 205357 | 246269 | 279142 | 1106769 | 279851 | 1088578 |
5 | 233336 | 210771 | 266573 | 1109853 | 286223 | 1115048 |
Trial | Extract Text | Image Urls | ||
---|---|---|---|---|
YMM FIRES | SCALA | PYTHON | SCALA | PYTHON |
trial 1 | 262 | 570 | 254171 | 948225 |
2 | 240 | 884 | 247527 | 939791 |
3 | 197 | 431 | 250581 | 958162 |
4 | 238 | 394 | 251905 | 968368 |
5 | 285 | 452 | 303458 | 952961 |
@greebie I was looking for the raw output of the script.
Output here: https://gist.github.com/greebie/e93dae5ba0869de43ef1d635c5bad0ce
If you are looking for a --verbose output let me know. I am running a big job right now unfortunately, but will get one out ASAP.
Python instructions are at https://gist.github.com/ianmilligan1/1436be06a5d2293bf3b6447493c962c3.
Encountered this lovely problem: https://www.thoughtvector.io/blog/python-3-on-spark-return-of-the-pythonhashseed/
I am going to try and avoid using dictionary-like processes for now. I think we should remove dictionary-like processes like .countByValue() from the docs as well.
Solution is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0
. This is a problem with Spark 2.1.1 and will be fixed in Spark 2.2.
Would appreciate a review of the following code to test against the above code before I start running the scripts at scale. Thanks!
import RecordLoader
from DFTransformations import *
from ExtractDomain import ExtractDomain
from ExtractLinks import ExtractLinks
from ExtractDate import DateComponent
from ExtractImageLinks import ExtractImageLinks
from RemoveHTML import RemoveHTML
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# replace with your own path to archive file
path = "/shared/au/example.arc.gz"
spark = SparkSession.builder.appName("filterByDate").getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
import time
def timed(func):
start = time.time()
func()
end = time.time()
print("Elapsed Time: " + str((end - start)*1000))
def test1 ():
print("Get urls and count, taking 3.")
df = RecordLoader.loadArchivesAsDF(path, sc, spark)
df.groupBy("domain").count().sort(desc("count")).show(n=3)
def test2():
print("Get Hyperlinks from text and site and count, filtering out counts < 5, taking 3.")
rdd = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
.flatMap(lambda r: ExtractLinks(r.url, r.contentString))\
.map(lambda r: (ExtractDomain(r[0]), ExtractDomain(r[1])))\
.filter(lambda r: r[0] is not None and r[0]!= "" and r[1] is not None and r[1] != "")
print(countItems(rdd).filter(lambda r: r[1] > 5).take(3))
def test3():
print ("Get links from text and site, group by date and count, filtering out counts < 5, taking 3.")
df = RecordLoader.loadArchivesAsDF(path, sc, spark)
fdf = df.select(df['crawlDate'], df['url'], df['contentString'])
rdd = fdf.rdd
rddx = rdd.map (lambda r: (r.crawlDate, ExtractLinks(r.url, r.contentString)))\
.flatMap(lambda r: map(lambda f: (r[0], ExtractDomain(f[0]), ExtractDomain(f[1])), r[1]))\
.filter(lambda r: r[-1] != None)\
.map(lambda r: (r[0], re.sub(r'^.*www.', '', r[1]), re.sub(r'^.*www.', '', r[2])))
print(countItems(rddx).take(3))
def test4():
print("Extract text, taking 3 examples.")
text = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
.map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString)))
print(text.take(3))
def test5():
print ("Extract image urls, taking 3.")
images = RecordLoader.loadArchivesAsRDD(path, sc, spark)\
.flatMap(lambda r: ExtractImageLinks(r.url, r.contentString))
print(countItems(images).take(3))
if __name__ == '__main__':
for _ in range(5):
timed(test1)
timed(test2)
timed(test3)
timed(test4)
timed(test5)
The script produces:
Get urls and count, taking 3.
+------------------+-----+
| domain|count|
+------------------+-----+
| www.archive.org| 132|
| deadlists.com| 2|
|www.hideout.com.br| 1|
+------------------+-----+
Elapsed Time: 8.73497486114502
Get Hyperlinks from text and site and count, filtering out counts < 5, taking 3.
[(('www.archive.org', 'www.archive.org'), 304), (('www.archive.org', 'wiki.etree.org'), 21), (('www.archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 15.355172872543335
Get links from text and site, group by date and count, filtering out counts < 5, taking 3.
[(('20080430', 'archive.org', 'archive.org'), 316), (('20080430', 'archive.org', 'wiki.etree.org'), 21), (('20080430', 'archive.org', 'creativecommons.org'), 12)]
Elapsed Time: 12.715660572052002
Extract text, taking 3 examples.
[('20080430', 'www.archive.org', 'http://www.archive.org/', ' document.location="http://www.archive.org/index.php"; Please visit our website at: http://www.archive.org '), ('20080430', 'www.archive.org', 'http://www.archive.org/index.php', ' Internet Archive Web | Moving Images | Texts | Audio | Software | Education | Patron Info | About IA Forums | FAQs | Contributions | Jobs | Donate Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children\'s Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs UploadAnonymous User (login or join us)\xa0 \xa0\xa0 Announcements (more) Free Ultra High-Speed Internet to Public Housing Rise of the HighTech Non-Profits Zotero and Internet Archive join forces \xa0\xa0 Web85 billion pages Advanced Search \xa0\xa0 Welcome to the Archive The Internet Archive is building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, we provide free access to researchers, historians, scholars, and the general public. \xa0\xa0 \xa0\xa0 Moving Images\xa0115,646 movies Browse\xa0\xa0 (by keyword) \xa0\xa0 Live Music Archive\xa048,893 concerts Browse\xa0\xa0 (by band) \xa0\xa0 Audio\xa0250,854 recordings Browse\xa0\xa0 (by keyword) \xa0\xa0 Texts\xa0395,004 texts Browse\xa0\xa0 (by keyword) \xa0\xa0 \xa0\xa0 Curator\'s Choice (more) A Few Good G-MenRandall Glass, the maker of "Warthog Jump," re-creates in "A Few Good G-Men" an entire scene from... \xa0\xa0 Curator\'s Choice (more) Grateful Dead Live at Nashville Municipal...Set 1 Sugaree Beat It On Down The Line Candyman Me And My Uncle -> Big River Stagger Lee Looks Like... \xa0\xa0 Curator\'s Choice (more) Zanstones - Slaakhuis: Live in Rotterdam, HollandZanstones confuses the dutch masses with this live display of wacked rhythms, whacked vocals, and... \xa0\xa0 Curator\'s Choice (more) Secret armies; the new technique of Nazi warfare \xa0\xa0 \xa0\xa0 Recent Reviews Code4Lib 2008: Can Resource Description become Rigorous Data?Average rating: Madonna adopts African baby.Average rating: \xa0\xa0 Recent Reviews Carolina Chocolate Drops Live at MerleFest on 2007-04-27Average rating: Grateful Dead Live at Oakland-Alameda County Coliseum on 1988-12-28Average rating: \xa0\xa0 Recent Reviews No ThoroughfareAverage rating: JAHTARI RIDDIM FORCE - Farmer In The Sky / Depth ChargeAverage rating: \xa0\xa0 Recent Reviews A manual of chemical analysis, qualitative and quantitativeAverage rating: Chemical lecture experiments; non-metallic elementsAverage rating: \xa0\xa0 \xa0\xa0 Most recent posts (write a post by going to a forum) more... Subject Poster Forum RepliesViewsDate Re: Making a mix for a chick I know... William Tell GratefulDead 0 6 20 minutes ago Re: Bob\'s shorts not going into archives BobsShortShorts GratefulDead 0 9 26 minutes ago Re: Thanks to All airgarcia416 GratefulDead 0 5 26 minutes ago Re: Bob\'s shorts not going into archives sydthecat2 GratefulDead 0 8 36 minutes ago Re: What is the worst-reviewed feature film on IA? RipJarvis feature_films 0 9 50 minutes ago Re: Playin\' In The Band...all day and all night sydthecat2 GratefulDead 0 11 58 minutes ago Re: Playin\' In The Band...all day and all night rastamon GratefulDead 0 16 1 hour ago Re: Making a mix for a chick I know... caspersvapors GratefulDead 1 11 1 hour ago Re: Bob\'s shorts not going into archives rastamon GratefulDead 0 11 1 hour ago Re: Bob\'s shorts not going into archives bluedevil GratefulDead 1 13 1 hour ago \xa0\xa0 \xa0Institutional Support Alexa Internet HP Computer The Kahle/Austin Foundation Prelinger Archives National Science Foundation Library of Congress LizardTech Sloan Foundation Individual contributors \xa0 Skin: classic | columns | custom! Terms of Use (10 Mar 2001) '), ('20080430', 'www.archive.org', 'http://www.archive.org/details/DrinkingWithBob-MadonnaAdoptsAfricanBaby887', " Internet Archive: Details: Madonna adopts African baby. Web | Moving Images | Texts | Audio | Software | Education | Patron Info | About IA Home Animation & Cartoons | Arts & Music | Computers & Technology | Cultural & Academic Films | Ephemeral Films | Movies | News & Public Affairs | Non-English Videos | Open Source Movies | Prelinger Archives | Sports Videos | Video Games | Vlogs | Youth Media Search: All Media Types \xa0\xa0Wayback Machine \xa0\xa0Moving Images \xa0\xa0\xa0\xa0Animation & Cartoons \xa0\xa0\xa0\xa0Arts & Music \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Cultural & Academic Films \xa0\xa0\xa0\xa0Ephemeral Films \xa0\xa0\xa0\xa0Movies \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Videos \xa0\xa0\xa0\xa0Open Source Movies \xa0\xa0\xa0\xa0Prelinger Archives \xa0\xa0\xa0\xa0Sports Videos \xa0\xa0\xa0\xa0Video Games \xa0\xa0\xa0\xa0Vlogs \xa0\xa0\xa0\xa0Youth Media \xa0\xa0Texts \xa0\xa0\xa0\xa0American Libraries \xa0\xa0\xa0\xa0Canadian Libraries \xa0\xa0\xa0\xa0Open Source Books \xa0\xa0\xa0\xa0Project Gutenberg \xa0\xa0\xa0\xa0Biodiversity Heritage Library \xa0\xa0\xa0\xa0Children's Library \xa0\xa0\xa0\xa0Additional Collections \xa0\xa0Audio \xa0\xa0\xa0\xa0Audio Books & Poetry \xa0\xa0\xa0\xa0Computers & Technology \xa0\xa0\xa0\xa0Grateful Dead \xa0\xa0\xa0\xa0Live Music Archive \xa0\xa0\xa0\xa0Music & Arts \xa0\xa0\xa0\xa0Netlabels \xa0\xa0\xa0\xa0News & Public Affairs \xa0\xa0\xa0\xa0Non-English Audio \xa0\xa0\xa0\xa0Open Source Audio \xa0\xa0\xa0\xa0Podcasts \xa0\xa0\xa0\xa0Radio Programs \xa0\xa0\xa0\xa0Spirituality & Religion \xa0\xa0Software \xa0\xa0\xa0\xa0CLASP \xa0\xa0\xa0\xa0Tucows Software Library \xa0\xa0Education Forums FAQs Advanced Search UploadAnonymous User (login or join us)\xa0 .grayBack { background-color: #D8DEDE; } .lightBack { background-color: #339933; } .lightBorder { border: 2px solid #339933; } .darkBack { background-color: #115500; } .darkBorder { border: 2px solid #115500; } .darkFore { color: #115500; } h1 { background-color: #115500; } h3 { background-color: #339933; } h2 { background-color: #D8DEDE; color: #115500; } View movie View thumbnails Run time: 00:01:37Play / Download (help) Quicktime (1.3 MB) All files: FTP HTTPResources Bookmark Report errorsMadonna adopts African baby. Internet Archive's in-browser video player requires JavaScript to be enabled. It appears your browser does not have it turned on. Please see your browser settings for this feature. embedding and helpMadonna is an arrogant, publicity hungry, piece of trash!!!This item is part of the collection: blip.tv Write a review Reviews Downloaded 61 times Average Rating: Reviewer: _sprout - - April 27, 2008Subject: Madonna is a washed up hag trying to keep her name in the papers+5 stars because I agree with your general statement that these 'exotic' kids are like exotic pets for rich people and celebs to show off.-2 stars because this sort of thing is better suited to Youtube.Reviewer: XXXmoan - - April 25, 2008Subject: are You freakin SeriousWhat the fuck! who cares if she goes to adopt an african, thats none of your business. you need to chill like that other bitch who claims that she doesn't care that madonna fell off a horse. so my question is.............................................................................what the fuck your problem Terms of Use (10 Mar 2001) ")]
Elapsed Time: 0.37613391876220703
Extract image urls, taking 3.
[('http://www.archive.org/images/star.png', 408), ('http://www.archive.org/images/no_star.png', 122), ('http://www.archive.org/images/logo.jpg', 118)]
Elapsed Time: 9.878446817398071
First test samples updated above. Seems like Tests nos. 2 & 3 are taking longer than expected. Not sure if this is a design issue or scale issue. Will need to look at these at scale -- will upload soon.
This all looks good to me, Ryan, seems consistent with the Scala ones. Looking forward to your thoughts!
Benchmarking revealed #130 – @greebie and I have been looking into fixing these serious performance issues.
Closing for now as dependent on a resolution to #130. Once #130's done, we can redo the numbers and see what changes.
Did the python version include pyarrow and pandas_udf ?? I saw many articles said pyspark boost under those setup .
We are comparing the timings of Scala vs Python to help us make informed decisions on the migration.