Installation | Syntax | Citation guidelines | Examples | Feedback | Change log
(22 Sep 2024)
This package allows users to draw Sankey plots in Stata. It is based on the Sankey Guide published on the Stata Guide on Medium on October 2021.
The package can be installed via SSC or GitHub. The GitHub version, might be more recent due to bug fixes, feature updates etc, and may contain syntax improvements and changes in default values. See version numbers below. Eventually the GitHub version is published on SSC.
SSC (v1.74):
ssc install sankey, replace
GitHub (v1.8):
net install sankey, from("https://raw.githubusercontent.com/asjadnaqvi/stata-sankey/main/installation/") replace
The palettes
package is required to run this command:
ssc install palettes, replace
ssc install colrspace, replace
Even if you have these packages installed, please check for updates: ado update, update
.
If you want to make a clean figure, then it is advisable to load a clean scheme. These are several available and I personally use the following:
ssc install schemepack, replace
set scheme white_tableau
You can also push the scheme directly into the graph using the scheme(schemename)
option. See the help file for details or the example below.
I also prefer narrow fonts in figures with long labels. You can change this as follows:
graph set window fontface "Arial Narrow"
The syntax for the latest version is as follows:
sankey value [if] [in], from(var) to(var) [ by(var) palette(str) colorby(layer|level) colorvar(var) stock colorvarmiss(str) colorboxmiss(str) smooth(1-8) gap(num)
recenter(mid|bot|top) ctitles(list) ctgap(num) ctsize(num) ctposition(bot|top)
ctcolor(str) labangle(str) labsize(str) labposition(str) labgap(str) showtotal labprop labscale(num) valsize(str) valcondition(num) format(str) valgap(str)
novalues valprop valscale(num) novalright novalleft nolabels
sort1(value|name[, reverse]) sort2(value|order[, reverse]) align fill lwidth(str) lcolor(str) alpha(num) offset(num) boxwidth(str) percent wrap(num) * ]
See the help file help sankey
for details.
The most basic use is as follows:
sankey value, from(var1) to(var2) [by(level)]
where var1
and var2
are source and destination variables respectively against which the value
variable is plotted. The by()
variable defines the levels and is optional since v1.72.
Software packages take countless hours of programming, testing, and bug fixing. If you use this package, then a citation would be highly appreciated. Suggested citations:
in BibTeX
@software{sankey,
author = {Naqvi, Asjad},
title = {Stata package ``sankey''},
url = {https://github.com/asjadnaqvi/stata-sankey},
version = {1.8},
date = {2024-09-22}
}
or simple text
Naqvi, A. (2024). Stata package "sankey" version 1.8. Release date 22 September 2024. https://github.com/asjadnaqvi/stata-sankey.
or see SSC citation (updated once a new version is submitted)
Get the example data from GitHub:
import excel using "https://github.com/asjadnaqvi/stata-sankey/blob/main/data/sankey_example2.xlsx?raw=true", clear first
Let's test the sankey
command:
sankey value, from(source) to(destination) by(layer)
sankey value, from(source) to(destination) by(layer) smooth(2)
sankey value, from(source) to(destination) by(layer) smooth(8)
sankey value, from(source) to(destination) by(layer) recenter(bot)
sankey value, from(source) to(destination) by(layer) recenter(top)
sankey value, from(source) to(destination) by(layer) gap(0)
sankey value, from(source) to(destination) by(layer) gap(20)
sankey value, from(source) to(destination) by(layer) noval showtot
sankey value, from(source) to(destination) by(layer) sort1(name)
sankey value, from(source) to(destination) by(layer) sort1(value)
sankey value, from(source) to(destination) by(layer) sort1(value) sort2(value)
sankey value, from(source) to(destination) by(layer) sort1(name, reverse) sort2(value)
sankey value, from(source) to(destination) by(layer) sort1(name, reverse) sort2(value, reverse)
sankey value, from(source) to(destination) by(layer) sort1(name, reverse) sort2(order)
sankey value, from(source) to(destination) by(layer) sort1(name, reverse) sort2(order, reverse)
Custom sorting on a value:
gen source2 = .
gen destination2 = .
foreach x in source destination {
replace `x'2 = 1 if `x'=="Blog"
replace `x'2 = 2 if `x'=="LinkedIn"
replace `x'2 = 3 if `x'=="Twitter"
replace `x'2 = 4 if `x'=="Direct"
replace `x'2 = 5 if `x'=="App"
replace `x'2 = 6 if `x'=="Medium"
replace `x'2 = 7 if `x'=="Website"
replace `x'2 = 8 if `x'=="Homepage"
replace `x'2 = 9 if `x'=="Total"
replace `x'2 = 10 if `x'=="Google"
replace `x'2 = 11 if `x'=="Facebook"
}
lab de labels 1 "Blog" 2 "LinkedIn" 3 "Twitter" 4 "Direct" 5 "App" 6 "Medium" 7 "Website" 8 "Homepage" 9 "Total" 10 "Google" 11 "Facebook", replace
lab val source2 labels
lab val destination2 labels
sankey value, from(source2) to(destination2) by(layer)
sankey value, from(source) to(destination) by(layer) boxwid(5)
sankey value, from(source) to(destination) by(layer) valcond(200)
sankey value, from(source) to(destination) by(layer) valcond(300)
sankey value, from(source) to(destination) by(layer) palette(CET C6)
sankey value, from(source) to(destination) by(layer) colorby(level)
gen trace1 = 1 if source=="App"
sankey value, from(source) to(destination) by(layer) colorvar(trace1)
cap drop trace2
gen trace2 = .
replace trace2 = 1 if source=="App" & destination=="App" & layer==0
replace trace2 = 2 if source=="App" & destination=="App" & layer==1
replace trace2 = 3 if source=="App" & destination=="App" & layer==2
replace trace2 = 4 if source=="App" & destination=="Total" & layer==3
sankey value, from(source) to(destination) by(layer) colorvar(trace2)
sankey value, from(source) to(destination) by(layer) colorvar(trace2) palette(Oranges)
sankey value, from(source) to(destination) by(layer) colorvar(trace2) palette(Blues) ///
colorvarmiss(gs13) colorboxmiss(gs13)
sankey value, from(source) to(destination) by(layer) colorvar(trace2) ///
palette(blue*0.1 blue*0.3 blue*0.5 blue*0.7) colorvarmiss(gs13) colorboxmiss(gs13)
sankey value, from(source) to(destination) by(layer) ctitles(Cat1 Cat2 Cat3 Cat4 Cat5)
sankey value, from(source) to(destination) by(layer) ctitles(Cat1 Cat2 Cat3 Cat4 Cat5) ctg(-100)
sankey value, from(source) to(destination) by(layer) ctitles("Cat 1" "Cat 2" "Cat 3" "Cat 4" "Cat 5") ctg(-100)
sankey value, from(source) to(destination) by(layer) ctitles("Cat 1" "Cat 2" "Cat 3" "Cat 4" "Cat 5") ctpos(top) ctg(100) recenter(top)
sankey value, from(source) to(destination) by(layer) noval showtot palette(CET C6) ///
laba(0) labpos(3) labg(-1) offset(10)
sankey value, from(source) to(destination) by(layer) novalleft
sankey value, from(source) to(destination) by(layer) novalright
sankey value, from(source) to(destination) by(layer) noval
sankey value, from(source) to(destination) by(layer) nolabels
sankey value, from(source) to(destination) by(layer) valprop vals(2)
sankey value, from(source) to(destination) by(layer) labprop labs(2)
sankey value, from(source) to(destination) by(layer) stock
sankey value, from(source) to(destination) by(layer) palette(CET C6) alpha(60) ///
labs(2.5) laba(0) labpos(3) labg(-1) offset(5) noval showtot ///
ctitles("Cat 1" "Cat 2" "Cat 3" "Cat 4" "Cat 5") ctg(-100) cts(3) ///
title("My sankey plot", size(6)) note("Made with the #sankey package.", size(2.2)) ///
xsize(2) ysize(1)
Please open an issue to report errors, feature enhancements, and/or other requests.
v1.8 (22 Sep 2024)
align
to align flows. Works only if there is just one parent (still beta).fill
to extrapolate missing flows. Works only if there is just one parent (still beta).n()
to allow users to increase the number of points for generating the arcs. Default is 30.v1.74 (11 Jun 2024)
wrap()
option for wrapping labels.v1.73 (16 Mar 2024)
from()
and to()
variables have value labels, then the order of the value labels is respected. This allows the users to have full control of the order of the drawing of the layers through value labels (requested by Katie Naylor + others).from()
and to()
have different format types. Both have to be either string or numeric variables. This was necessary to implement in order to implement the above change.v1.72 (12 Feb 2024)
labprop
from wrong calculation the label sizes.valcond()
now passes on to box labels. Was removed but has been put back in.by()
changed to optional. Assumes one layer if not specified. This is mostly a quality of life improvement. A warning message is displayed to ensure that by()
is not left out by mistake.ctsize()
converted to string allow size names.ctcolor()
added.v1.71 (15 Jan 2024)
from()
and to()
variables with value labels were messing up the labels in the final figure (reported by Ian White).v1.7 (06 Nov 2023)
valcond()
dropping bar values.ctitles()
getting random colors. It now defaults to black.ctpos()
option to change column title position.percent
option which is still beta. Convert flows to percent values.v1.61 (22 Jul 2023)
saving()
option added (requested by Anirban Basu).v1.6 (11 Jun 2023)
sortby()
split into sort1()
and sort2()
for clarity.stock
added to collapse own flows (source = destination) to box heights (requested by Oras Alabas).v1.51 (25 May 2023)
from()
and to()
variable. This ensures that the code runs regardless of the variable types. Ideally both should be strings.v1.5 (30 Apr 2023)
laprop
, titleprop
, and labscale()
for scaling values and labels.novalright
, novalleft
, nolabels
options.sortby(., reverse)
option.v1.4 (23 Apr 2023)
v1.31 (04 Apr 2023)
v1.3 (26 Feb 2023)
sortby()
added that allows alphabetical sorting (sortby(name)
) or numerical sorting sortby(value)
(Thanks to Fabian Unterlass for detailed feedback).boxwdith()
added to allow adjusting the width of node boxes.v1.21 (15 Feb 2023)
valcond()
fixed.v1.2 (02 Feb 2023)
v1.1 (13 Dec 2022)
valformat()
renamed to just format
. This aligns it with standard Stata usages.offset()
added to displace x-axis on the right-hand side. Offset is given in percentage share of x-axis range. This allows rotated labels to be displaced properly.v1.0 (08 Dec 2022)