Ideas for SPSS vs R comparison series

bobaekang commented 4 years ago

Summary

Please leave a comment to share ideas and interesting data to use for the SPSS vs R comparison series.

Main

To make the R User Group meetings more useful and engaging for more people, we are starting a new series of meetings in 2020 focusing on comparing SPSS and R.

The basic idea here is to select a set of tasks and challenges involving various aspects of data analysis (e.g. data transformation, visualization, statistical modeling, etc.) and try the solutions in both SPSS and R. In this way, we will be able to see how solutions using one tool can be translated into those using the other--and improve our understanding in both tools!

To make this idea really work, we need your inputs. What data analysis tasks do you think would be relevant and informative? Do you know some interesting datasets to try out for these comparisons? How do you think these comparisons should be formatted/carried out?

👉 Please leave a comment below to share your thoughts. 👈 Feel free to comment on others' ideas or ask questions if you need more clarification on this series. In the 2020 January meeting, we will make an action plan for the series based on the discussion in this thread.

Remember, every idea/comment counts. Let's get started 🚀

hdotto commented 4 years ago

An SPSS feature I've found useful in the past (and have no idea how to do in R) is the duplicate cases identifier. In SPSS, it's pretty simple as an interface where the user can simply drag and drop the variables they want to match and sort on, but in R I'm not sure what the code would be.

For example, I've found this useful when it comes to identifying the first exit from IDOC for an individual case in a database where there can be multiple admissions and exits for the same person. In SPSS, you would place the identifier variable (i.e., DOC number) into the "Define matching cases by:" box and the exit date variable in the "Sort within matching groups by:" box. You then check whether you want to create a new variable that returns a 1 for someone's first or last exit. So, if you want to flag their first exit, you would get a variable with a 1 for the first exit and a 0 for all subsequent.

I know this isn't the most analytical/complex example but it's something I would like to know in R as I've found it surprisingly useful in SPSS for CJ purposes.

bobaekang commented 4 years ago

Thank you @hdotto for your suggestion! Handling duplicate rows in a table sounds like a great case study for comparing SPSS and R.

Your example, by the way, seemingly points to something slightly more elaborate, which involves sorting and filtering table by group to find a representative case (e.g. minimum value for exit date) for each value of a select group (e.g. DOC number). This is certainly more than simply discarding complete duplicates and leaving distinct/unique rows only.

In fact, they are both good study cases for the comparison series. And, in my view, simple, well-defined, yet commonly found examples like yours is much preferable to highly specialized ones for our meetings.

mwpowers commented 4 years ago

Possibly pivot tables, and what types of summary statistics can be produced with them. I am curious if SPSS has something like pivot tables in Excel or in the rPivotTable package that doesn't require an SPSS add-on licence or the SPSS python extension.

bobaekang commented 4 years ago

Thank you @mwpowers for your suggestion! I just looked up rpivotTable package and skimmed through this introductory vignette. The package seems like a cool project.

In my view, however, the limitation of rpivotTable is that it only provides a graphical interface that is more or less divorced from the main data analysis workflow. While rpivotTable provides a means to visually and interactively explore a dataset at hand as in Excel, I cannot find a feature to export the resulting, transformed table into either a file or code.

I don't use Excel extensively, but my understanding is that, at its core, pivot table serves as an intuitive interface to grouped operations and aggregations for a tabular data. From this perspective, most of the pivot table operations should be reproducible in R and probably in SPSS, too. In that case, I think it would be a very fun and useful exercise to identify a (sub)set of common operations handled via Excel pivot table and implement them using SPSS or R.

What do you think?

mwpowers commented 4 years ago

I guess it depends on what is considered workflow. In SPAC, we often have followup questions that involve slicing up data further and checking some numbers without much notice, even if it does not go into a final product. For CHRI data in particular, it's sometimes nice to have the pivot table saved in an rmarkdown file if you need to slice the data a little more and still maintain a consistent total, where as you may not be able to reproduce it if you are querying the data after an update.

bobaekang commented 4 years ago

@mwpowers, that is a fascinating use case! Regardless of the comparison series, I would love to have you showcase how you integrate interactive pivot tables and how they serve your real-world needs. If you don't mind, please consider opening a new Issue post introducing rpivotTable or presenting your use case in one of the future User Group meetings 👍

As for data analysis "workflow", I was mainly thinking more narrowly & in terms of building reproducible steps for transforming data objects (e.g. tables) and generating stable data products (numerical summaries, figures, statistical models, etc.) as results. And for the comparison series, trying implementations of common pivot table operations sounds like a good idea to me 😃

lgleich3 commented 4 years ago

I agree with H.Doug :) --finding duplicate cases, for example, if someone has been admitted to IDOC or IDJJ twice in the same year (or fiscal year). Or for CHRI, how to get one row for each individual without having duplicates---this is essentially what one rec was/would be intended to do but unsure when or if that will ever come to fruition.

I also struggle with just basic data cleaning in R. I tend to do it in excel or spss and then put that in R.

Also, if there is anything for qualitative data analysis, that would be interesting.

justinilla commented 4 years ago

I agree with @hdotto too. Working with data where multiple rows refer to one individual (in a longitudinal sense) or where identical rows have somehow found their way into the data, is a task I find myself doing regularly. This was especially true in our most recent project when choosing among multiple prison admissions and exits, parole admissions, and arrests for one individual. Could contribute some of that perspective (and amateur R script to be improved) come discussion time. There are some real roadblocks we encountered in the data and had to work around in R.

Re: pivot tables, I think their real strength in Excel is that they feature a graphical user interface (GUI) for non-destructively manipulating data - allowing exploratory analyses to be done in a visual way that quickly produces a visual depiction of the data sliced however the user chooses (similar to Tableau). To me, R is not really for this purpose. Its data transformation capabilities are very powerful, and so are its visualization abilities, but these mostly require two separate, very intentional, efforts that take time. I too find myself using pivot tables in Excel often to answer quick questions for smaller sets of data and am not sure if I'd ever switch. This is either an exploratory data analysis and visualization topic or maybe there should be a decision tree on whether R is warranted for a certain type of task that takes into account the context (e.g., available time, current expertise, explain-ability, whether files will be shared with others, etc.). We all love decision trees right?... Just a thought.

Related to this, a potential additional topic could include aggregating and merging data in R. It is often relatively straightforward with Tidyverse, but if you're just getting started in R it can get complicated quickly when aggregating rows based on certain conditions (e.g., sum of individuals with a felony drug charge by county) or when merging variables into individual-level records from multiple data sources with different matching criteria (e.g. matching across IDOC data and CHRI data). Mostly this requires duplicate data to have already been addressed (or to be addressed after) and might fit it in nicely as a sequel (or prequel) to the duplicate data discussion.

bobaekang commented 4 years ago

Hey @lgleich3, can you give us some of those "basic data cleaning" operations for which you found yourself turning to SPSS/Excel instead of R? They could be potential comparison cases for this series!

As for the qualitative data analysis, R does have packages that facilitate text analysis. For example, take a look at this online book.

There is also a project/package for creating a graphical user interface for qualitative data analysis called RQDA although I can't say much other than it exists. If you're interested, here is a tutorial on using RQDA that is fairly recent and comprehensive (with 26 YouTube clips!). Of course, I have no idea what the possibilities are like on the SPSS side.

agenko2 commented 4 years ago

With respect to basic cleaning, I want to ask about displaying data labels from an SPSS dataset. The variable name appears with no issues, but the labels come back NULL. I would love some help with this because the variable names are not helpful in this case. The labels appear in the "view," but I haven't found a functional way to display them for the purposes of running descriptive statistics.

kgruschow commented 4 years ago

The package sjlabelled has the functions, with what I consider user hostile function names:

sjlabelled:getlabel() - returns the variable label, or column label sjlabelled:getlabels() - returns the value labels, i.e. recodes

As you know I gave up before finding this at least twice, and have to give a shout to my favorite:

remove_all_labels() - returns the object with all SPSS label data structures removed, because I have had a horrible time with consistently accessing SPSS labelled numeric columns as numeric.

bobaekang commented 4 years ago

@agenko2, incidentally, I have shared a while ago an article on the exact issue. The article linked/introduced in that Issue post discusses the sjlabelled package @kgruschow mentions as well as some other possible solutions. I recommend you to give it a read.

More broadly, the difficulties with SPSS labels in R is part of the cost of switching tools for data analysis work. Going from SPSS to R and vice versa requires a bit of rethinking re: data analysis problems and solutions. And working through such differences to gain a deeper understanding of various data analysis tasks is definitely what the SPSS vs R series is about!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ICJIA / r-user-group

Ideas for SPSS vs R comparison series #35

Summary

Main