SR-G / watchthatpage

Command line utility allowing to monitor remote web pages and trigger mail notifications when some modifications are detected.
Apache License 2.0
80 stars 11 forks source link

watchthatpage

watchthatpage is a command line program that may trigger mail notifications when some changes are detected on remote web pages. Monitored URLs are customizable (as many as you want) - in addition, some selectors may be configured in order to extract, if needed, only some relevant part. Notification mail are templated and are also configurable. Optionaly, some screenshots of website may be taken (with an external dependency, see examples).

Examples

Usage

WatchThatPage is a command line program used to trigger notifications when some HTML page contents is modified

Usage:
  watchthatpage [command]

Available Commands:
  clean       Clean cached content
  grab        Grab pages
  help        Help about any command
  version     Print the version number of watchthatpage

Flags:
      --configuration string   Configuration file name. Default is binary name + .json (e.g. 'watchthatpage.json'), in the same folder than the binary itself (default "watchthatpage.json")
  -h, --help                   help for watchthatpage

Use "watchthatpage [command] --help" for more information about a command.

Example of output :

Configuration file found under [watchthatpage.json], now loading content
Configuration loaded with [3] urls, gzip [false], minify [true], auto backup [true], generate screenshots [true], sections to skip [script footer meta style map img nav select form noscript]
Now parsing URL [https://www.bostonglobe.com/news/bigpicture]
Now parsing URL [https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal]
Now parsing URL [https://news.google.com/news/?ned=fr&gl=FR&hl=fr]
Results : 
  - [DIFF]  URL [https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal], analysis took [11.884853454s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//3f7b7021cd0f50958448273113c2ea1e]
  - [DIFF]  URL [https://news.google.com/news/?ned=fr&gl=FR&hl=fr], analysis took [18.044830746s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//f3f1bce5db55b5e02b1479de568b0128]
  - [DIFF]  URL [https://www.bostonglobe.com/news/bigpicture], analysis took [22.10838874s], cached content [/mnt/internal/sata/downloads/downloaded/workspaces/go/watchthatpage/bin/linux/cache//ee545bc3b80dc0cd684453067b12527a]
Total execution time [22.10883423s], analyzed urls [3], errors [0], diffs [3]
Notifications to [......@gmail.com], from [webmaster@domain.tld], server [smtp.gmail.com:587], template [.../templates/multi-columns.tmpl]
(...)

Example of generated mail :

Generated mail

Configuration

Example configuration :

{
  "Urls" : [ 
    "https://asuswrt.lostrealm.ca/download",
    "http://tim.blog/gear/"
  ],
  "Selectors" : {
    "http://www.fiio.me/forum.php?mod=viewthread&tid=39932" : { "Selector" : "td[id=postmessage_105396]", "SelectorsToSkip" : [ "ignore_js_op" ] }
  },
  "LogLevel" : "INFO",
  "Gzip" : false,
  "MinifyHTML" : true,
  "GenerateScreenshots" : true,
  "ScreenshotCommand" : "/usr/bin/docker run --rm -v ${cache}:/images kevinsimper/wkhtmltoimage --quality 75 --crop-h 720 --format jpg ${url} /images/${filename}.jpg",
  "NotificationMail" : { 
    "template" : "templates/multi-columns.tmpl",
    "to" : "<recipient>@<domain.tld>",
    "from" : "<sender>@<domain.tld>",
    "subject" : "WatchThatPage results : {{ .NbDiff }} page(s) changed",
    "smtp-hostname" : "smtp.gmail.com",
    "smtp-tls" : true,
    "smtp-port" : 587,
    "smtp-login" : "<login>@<domain.tld>",
    "smtp-password" : "<password>"
  }
  ,
  "SectionsToSkip" : [ 
    "script", 
    "footer", 
    "meta", 
    "style", 
    "map",
    "img",
    "form",
    "noscript"
  ]
}

Template

Available items are defined in the results.go file

And for each result (in the result.go file) :

Example of a basic template (has to be configured in the json configuration file) :

<html>
<body>
<p>
On {{ .Date }}, {{ .NbUrls }} URLs have been analyzed - {{ .NbErrors }} error(s), {{ .NbDiff }} difference(s), execution time {{ .ExecutionTime }}.<br />
</p>

<p>
List of found differences :
</p>
<ul>
    {{ range .Results }}  
        {{ if .HasDifferences }}
            <li><a href="https://github.com/SR-G/watchthatpage/blob/master/{{ .Url }}" target="_blank">{{ .Url }}</a></li>
        {{ end }}
    {{ end }}
</ul>

</body>
</html>

Crontab

In order to have this process running each day, just put in the system crontab something like :

30  06  *   *   *       /home/applications/watchthatpage/watchthatpage grab > /var/log/cron-watchthatpage.log 2>&1

Folder structure is dependant of the configuration (default configuration file in same folder - otherwise to be specified through the --configuration flag), templates path defined in JSON configuration.

cache
templates
watchthatpage
watchthatpage.json

Links

Development

Build

Init and build from host :

docker pull golang:alpine
docker run --rm -it -v $(pwd):/go golang go get -d ./...
docker run --rm -it -v $(pwd):/go golang go install tensin.org/watchthatpage

Work from inside a container :

docker run --rm -it -v $(pwd):/go golang /bin/bash

Build from alpine docker image (parameters are used to generate static and reduced binaries) :

go install -ldflags "-d -s -w -X tensin.org/watchthatpage/core.Build=`git rev-parse HEAD`" -a -tags netgo -installsuffix netgo tensin.org/watchthatpage 

Cross-compile :

GOARCH=amd64 GOOS=windows go install ...

Dependencies

github.com/PuerkitoBio/goquery
github.com/fatih/color
github.com/spf13/cobra
github.com/tdewolff/minify
github.com/tdewolff/minify/css
github.com/tdewolff/minify/html
github.com/tdewolff/minify/js
golang.org/x/net/html
grep -R -h "github" *|sort -u 
grep -R -h "golang" *|sort -u 

TODO

To add

Find a proper name

Keywords :

watch
watch that page
explore
analyze
scrap
parse
diff
differences
delta
extract
read