asjadnaqvi / stata-sankey

A Stata package for Sankey diagrams
MIT License
21 stars 7 forks source link

Error with labeled/numerical variables -- allow sort on numerical variable #16

Closed jackmagi closed 1 year ago

jackmagi commented 1 year ago

Asjad, thank you very much for this amazing package!

I'm running into an issue with labeled/numerical variables. Here's an example.

If I do

import excel using "https://github.com/asjadnaqvi/stata-sankey/blob/main/data/sankey_example2.xlsx?raw=true", clear first sankey value,from(source ) to(destination ) by(layer )

it works well.

But if I try to reproduce a similar graph with labeled variables

import excel using "https://github.com/asjadnaqvi/stata-sankey/blob/main/data/sankey_example2.xlsx?raw=true", clear first sankey value,from(source ) to(destination ) by(layer ) encode source ,g(source_n) encode destination ,g(destination_n) sankey value,from(source_n ) to(destination_n ) by(layer )

I get the error:

not possible with numeric variable r(107);

As per the sankey command description, I thought the above should have worked (the help says " The command requires a numeric variable. Both from() and to() can contain numeric, labeled or string variables.")

Could you please advise on what is going wrong?

Thank you, Giacomo

asjadnaqvi commented 1 year ago

The error is in the help file. from() and to() should only be strings. I will also add checks for this.

Encoding can mess up the order of the variable mapping since different value labels can be assigned to the same string if the options in from() and to() vary. Since the layout is generated iteratively, this can mess up the correct mapping.

jackmagi commented 1 year ago

Thanks! However, there are cases in which being able to have from() and to() numerical/labeled variables would be useful. For example, I am dealing with a situation in which I'd like a precise display order for the values of my from() to() string variables, which however is not alphabetical (specifically, the values are "Support", "Neutral", "Oppose". I would like them to be sorted in this order). Labeled variables would do the job.

asjadnaqvi commented 1 year ago

Using value labels is highly risky as mentioned in the comment above.

If you encode the from() and to() variables to make them numeric, then there is a good chance it will mess up the order. Here is an example:

from to value layer
A B 100 1
A C 50 1
C A 20 2

would encode into

from to value layer
1 2 100 1
1 3 100 1
2 1 100 2

Here we can see that in the original A -> C -> A, but in the encoded version 1 (A) -> 2 (B incoming, C outgoing) -> A. This would end up generating a wrong figure. I can of course allow this option and let the users be responsible for making sure the figures are generating fine.

What you are requesting is a custom ordering option. This is already in the works.

jackmagi commented 1 year ago

Absolutely. Custom ordering would be ideal in my situation. Looking forward to the new version with that option. Thank you!