corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
499 stars 34 forks source link

Question: freq in vaccinations dataset #70

Closed aaronzhangSema4 closed 3 years ago

aaronzhangSema4 commented 3 years ago

Why does a subject has a freq column?

Thanks for this great package. I am reading the tutorial:

https://cran.r-project.org/web/packages/ggalluvial/vignettes/ggalluvial.html

The last example got me really confused. What does the value of frequency mean for subject 1 at survey "ms153_NSA"?

corybrunson commented 3 years ago

Hi @aaronzhangSema4 and thank you for the praise. This is a good question, and one i've elided in the vignette.

The answer is contained in the documentation for the vaccinations data set; run help(vaccinations) to read it, or see the page on the website. The "freq" column is the number of survey respondents who had the same set of responses to all three surveys. This means that the "subject" column is a bit of a misnomer; it identifies the cohort of subjects with common responses, not the individual subject. Does that clarify it for you?

I think your confusion is reasonable, so i'll add a note about this to the vignette for the upcoming release. Thank you for raising the issue!

aaronzhangSema4 commented 3 years ago

Thank you @corybrunson for the quick response. I agree that it makes more sense if each "subject" actually represent a "cohort".

However, it still does not make sense if the "response" column represents same responses to all three surveys. I made sure that there are three surveys in the dataset:

> vaccinations %>% select(survey, start_date, end_date) %>% distinct()
     survey start_date   end_date
1 ms153_NSA 2010-09-22 2010-10-25
2 ms432_NSA 2015-06-04 2015-10-05
3 ms460_NSA 2016-09-27 2016-10-25

Then for each cohort, they should have the same freq at each axes of the alluvial plot. For example, if 50 respondents gave "Always" to each of the three surveys, then it should be 50 at axe1, axe2 and axe3. I think I am missing something, or the data documentation lacked something...

corybrunson commented 3 years ago

Yes, there are only three surveys, though not the same total numbers of participants responded, for example, "Always" to each one. Rather, what should be constant is the number of participants in each cohort at each survey, i.e. the value of "freq" should be the same for each value of "subject". That holds up when i inspect the data. Does it make sense?

corybrunson commented 3 years ago

The explanation was added in commit 3413a84d94a977ffc73b06de31c5f60fd114ca5f.

aaronzhangSema4 commented 3 years ago

Thanks. I misunderstood part of the data precessing. Now it makes sense.