watchthatpage
is a command line program that may trigger mail notifications when some changes are detected on remote web pages. Monitored URLs are customizable (as many as you want) - in addition, some selectors may be configured in order to extract, if needed, only some relevant part. Notification mail are templated and are also configurable. Optionaly, some screenshots of website may be taken (with an external dependency, see examples).
WatchThatPage is a command line program used to trigger notifications when some HTML page contents is modified
Usage:
watchthatpage [command]
Available Commands:
clean Clean cached content
grab Grab pages
help Help about any command
version Print the version number of watchthatpage
Flags:
--configuration string Configuration file name. Default is binary name + .json (e.g. 'watchthatpage.json'), in the same folder than the binary itself (default "watchthatpage.json")
-h, --help help for watchthatpage
Use "watchthatpage [command] --help" for more information about a command.
Example of output :
Configuration file found under [watchthatpage.json], now loading content
Configuration loaded with [3] urls, gzip [false], minify [true], auto backup [true], generate screenshots [true], sections to skip [script footer meta style map img nav select form noscript]
Now parsing URL [https://www.bostonglobe.com/news/bigpicture]
Now parsing URL [https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal]
Now parsing URL [https://news.google.com/news/?ned=fr&gl=FR&hl=fr]
Results :
- [DIFF] URL [https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal], analysis took [11.884853454s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//3f7b7021cd0f50958448273113c2ea1e]
- [DIFF] URL [https://news.google.com/news/?ned=fr&gl=FR&hl=fr], analysis took [18.044830746s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//f3f1bce5db55b5e02b1479de568b0128]
- [DIFF] URL [https://www.bostonglobe.com/news/bigpicture], analysis took [22.10838874s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//ee545bc3b80dc0cd684453067b12527a]
Total execution time [22.10883423s], analyzed urls [3], errors [0], diffs [3]
Notifications to [......@gmail.com], from [webmaster@domain.tld], server [smtp.gmail.com:587], template [.../templates/multi-columns.tmpl]
(...)
Example of generated mail :
Example configuration :
{
"Urls" : [
"https://asuswrt.lostrealm.ca/download",
"http://tim.blog/gear/"
],
"Selectors" : {
"http://www.fiio.me/forum.php?mod=viewthread&tid=39932" : { "Selector" : "td[id=postmessage_105396]", "SelectorsToSkip" : [ "ignore_js_op" ] }
},
"LogLevel" : "INFO",
"Gzip" : false,
"MinifyHTML" : true,
"GenerateScreenshots" : true,
"ScreenshotCommand" : "/usr/bin/docker run --rm -v ${cache}:/images kevinsimper/wkhtmltoimage --quality 75 --crop-h 720 --format jpg ${url} /images/${filename}.jpg",
"NotificationMail" : {
"template" : "templates/multi-columns.tmpl",
"to" : "<recipient>@<domain.tld>",
"from" : "<sender>@<domain.tld>",
"subject" : "WatchThatPage results : {{ .NbDiff }} page(s) changed",
"smtp-hostname" : "smtp.gmail.com",
"smtp-tls" : true,
"smtp-port" : 587,
"smtp-login" : "<login>@<domain.tld>",
"smtp-password" : "<password>"
}
,
"SectionsToSkip" : [
"script",
"footer",
"meta",
"style",
"map",
"img",
"form",
"noscript"
]
}
CacheDirectory (optional, default is cache/
) : defines where previous pages content will be stored
Gzip (optional, false by default) : defines if cached content has to be gzipped in order to reduce space usage
MinifyHTML (optional, false by default) : defines if retrieved HTML has to be minified (removal of whitespaces, ...)
SectionsToSkip (optional, empty by default) : list of HTML tags that needs to be removed (should include tags that may be different at each page load) : script
, footer
, meta
, form
, noscript
, style
are good values to exclude
LogLevel : not implemented at this time
NotificationMail : which parameters to use for the mail notification
AutoBackup (optional, false by default) : automatically creates a backup of the previous cache content (allows to manually review differences)
Urls : URLs to grab (without additional customizations)
Selectors : URLs to grab, but with additional selectors (to include one HTML entry to grab and/or to add exclusions)
div[id=article_show_head_right]
) GenerateScreenshots : (optional, false by default) should screenshots be generated or not - next parameter must then be configured
ScreenshotCommand (optional, blank by default) : system command that will be executed in order to take screenshot of remote web pages. Variables like ${url}
, ${cache}
, ${filename}
will be injected at running time and needs thus to be entered exactly like this in the JSON configuration. Examples may be :
/usr/bin/docker run --rm --name wkhtmltoimage -v ${cache}:/images kevinsimper/wkhtmltoimage --quality 75 --crop-h 720 --format jpg ${url} /images/${filename}.jpg
(may be used exactly like that)wkhtmltoimage --quality 75 --crop-h 720 --format jpg ${url} <absolute-path-to-cache-directory>/${filename}.jpg
(cache path has to be customized)Available items are defined in the results.go
file
And for each result (in the result.go
file) :
Example of a basic template (has to be configured in the json
configuration file) :
<html>
<body>
<p>
On {{ .Date }}, {{ .NbUrls }} URLs have been analyzed - {{ .NbErrors }} error(s), {{ .NbDiff }} difference(s), execution time {{ .ExecutionTime }}.<br />
</p>
<p>
List of found differences :
</p>
<ul>
{{ range .Results }}
{{ if .HasDifferences }}
<li><a href="https://github.com/SR-G/watchthatpage/blob/master/{{ .Url }}" target="_blank">{{ .Url }}</a></li>
{{ end }}
{{ end }}
</ul>
</body>
</html>
In order to have this process running each day, just put in the system crontab something like :
30 06 * * * /home/applications/watchthatpage/watchthatpage grab > /var/log/cron-watchthatpage.log 2>&1
Folder structure is dependant of the configuration (default configuration file in same folder - otherwise to be specified through the --configuration flag), templates path defined in JSON configuration.
cache
templates
watchthatpage
watchthatpage.json
Init and build from host :
docker pull golang:alpine
docker run --rm -it -v $(pwd):/go golang go get -d ./...
docker run --rm -it -v $(pwd):/go golang go install tensin.org/watchthatpage
Work from inside a container :
docker run --rm -it -v $(pwd):/go golang /bin/bash
Build from alpine docker image (parameters are used to generate static and reduced binaries) :
go install -ldflags "-d -s -w -X tensin.org/watchthatpage/core.Build=`git rev-parse HEAD`" -a -tags netgo -installsuffix netgo tensin.org/watchthatpage
Cross-compile :
GOARCH=amd64 GOOS=windows go install ...
github.com/PuerkitoBio/goquery
github.com/fatih/color
github.com/spf13/cobra
github.com/tdewolff/minify
github.com/tdewolff/minify/css
github.com/tdewolff/minify/html
github.com/tdewolff/minify/js
golang.org/x/net/html
grep -R -h "github" *|sort -u
grep -R -h "golang" *|sort -u
Keywords :
watch
watch that page
explore
analyze
scrap
parse
diff
differences
delta
extract
read