ddionrails / collect_stata

Accumulate data from stata files and write it into an open format
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Change write_json to accept a different data structure for the metadata argument #56

Closed hansendx closed 5 years ago

hansendx commented 5 years ago

54 adds a method to extract metadata of a .dta file in a different data structure.

This data structure is based on the current output of write_json. write_json should take this input and add all further data (stats etc.) to it. This can then directly written to a file.

todo[bot] commented 5 years ago

this function should be able to handel a list as metadata input.

https://github.com/ddionrails/collect_stata/blob/7f1ca90293585356dc345f6e811a6438673725bd/collect_stata/write_json.py#L269-L285


This comment was generated by todo based on a TODO comment in ab516b9e3a814debfbc7e496367f1d5ec5f8bd7f in #56. cc @ddionrails.
mpahl commented 5 years ago

Testmetadata:

metadata_new = [
            {
                "name": "HKIND",
                "label": "Kinder",
                "type": "category",
                "scale": "cat",
                "categories": {
                    "values": [-1, 1, 2],
                    "labels": ["keine Antwort", "Ja", "Nein"],
                    "missings": [True, False, False],
                },
            },
            {
                "name": "HM04",
                "label": "Miete in Euro",
                "type": "int",
                "scale": "number",
            },
            {
                "name": "HKGEBA",
                "label": "test for string var",
                "type": "str",
                "scale": "string",
            }
        ]

Output:

[
  {
    "name": "HKIND",
    "label": "Kinder",
    "type": "category",
    "scale": "cat",
    "categories": {
      "values": [
        -1,
        1,
        2
      ],
      "labels": [
        "keine Antwort",
        "Ja",
        "Nein"
      ],
      "missings": [
        true,
        false,
        false
      ],
      "frequencies": [
        0,
        6,
        5
      ]
    },
    "study": "soep-test",
    "statistics": {
      "valid": 11,
      "invalid": 1
    }
  },
  {
    "name": "HM04",
    "label": "Miete in Euro",
    "type": "int",
    "scale": "number",
    "study": "soep-test",
    "statistics": {
      "Min.": 300.0,
      "1st Qu.": 522.5,
      "Median": 850.0,
      "Mean": 746.5,
      "3rd Qu.": 972.5,
      "Max.": 1025.0,
      "valid": 10,
      "invalid": 2
    }
  },
  {
    "name": "HKGEBA",
    "label": "test for string var",
    "type": "str",
    "scale": "string",
    "study": "soep-test",
    "statistics": {
      "valid": 6,
      "invalid": 6
    }
  }
]
todo[bot] commented 5 years ago

Setting dataset for every variable creates a lot of redundancy.

https://github.com/ddionrails/collect_stata/blob/c622b08f44e3ec14fc5b18659f0409a109a91c27/collect_stata/read_stata.py#L72-L77


This comment was generated by todo based on a TODO comment in c622b08f44e3ec14fc5b18659f0409a109a91c27 in #56. cc @ddionrails.