Roche / pyreadstat

Python package to read sas, spss and stata files into pandas data frames. It is a wrapper for the C library readstat.
Other
315 stars 58 forks source link

Wrong alignment for SAV (SPSS) files and enhnace to read roles #246

Open lodonnel opened 10 months ago

lodonnel commented 10 months ago

Task: To readin the file specification of an SPSS data file. where FILE_PATH is the path to the file attached

df, data = pyreadstat.read_sav(FILE_PATH, metadataonly=False)

Description 3 Problems Bug: 1. Attributes missing_ranges and missing_user_values remain at {} even though the file does have missing value entries in variable view in SPSS. ... from pycharm missing_ranges = {dict:0} {} missing_user_values = {dict:0} {}

Bug: 2. variable_alignment is a set of key/value pairs where the value is always 'unknown' variable_alignment = {dict: 12} C22326161.zip {'ID': 'unknown', 'Responded': 'unknown', 'Previous': 'unknown', 'Controlpackage': 'unknown', 'Age': 'unknown', 'Income': 'unknown', 'Education': 'unknown', 'Reside': 'unknown', 'Gender': 'unknown', 'Married': 'unknown', 'Children': 'unknown', 'Region': 'unknown'}

Enhancement: 3. role is currently not included in the attributes of the data block.

Describe the issue A clear and concise description of what the issue is.

To Reproduce df, data = pyreadstat.read_sav(FILE_PATH, metadataonly=False) the data block holds the attributes mentioned above

File example Attached

Expected behavior

  1. Both missing_range and missing_user_values to be populated correctly
  2. Variable_alignment should be populated with { Left, Right or Center} instead of Unknown
  3. Role should be added as an attribute woith possible values {Input, Target, Both, None, Partition, Split}

Setup Information: How did you install pyreadstat? (pip, conda, directly from repo) using pip Platform (windows, macOS, linux, 32 or 64 bit) windows 10 Python Version 3.9 Python Distribution (System, plain python, Anaconda) Using Virtualenv or condaenv? venv

ofajardo commented 9 months ago

Thanks for the reproducible issue.

Regarding missing_ranges, please use the argument user_missing=True to get them. Please take the time to read the Readme in the section about missing values where this is explained. missing_user_values is only for SAS and STATA as described in the [module documentation(https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html) and therefore will always be empty for SPSS files.

ofajardo commented 9 months ago

For alignment it seems that the underlying C library Readstat is not reading them correctly. The same library does not currently define any function to extract role, so it cannot be obtained in pyreadstat.

If you can please report these issues in Readstat directly with your example. Otherwise I may do it at a later point in time. You have to be aware that issues in Readstat typically take very long to be solved, so I would say a solution to those two is not likely to appear in the near future.