DistrictDataLabs / cultivar

Multidimensional data explorer and visualization tool.
http://trinket.districtdatalabs.com
Apache License 2.0
52 stars 18 forks source link

Continuous or Categorical #23

Open DataFighter opened 8 years ago

DataFighter commented 8 years ago

Trinket should have some feature to determine if data is continuous or categorical.

This should be somewhat guessed on behalf on the user, by the system. However, ultimately the user should have control.

rebeccabilbro commented 8 years ago

@doctorf72 is working on this in a fork during the sprints

doctorf72 commented 8 years ago

This is taken from Messy Tables Documentation. type_guess method from types class trying to guess column type by accounting number of successful conversions. Unfortunately, no Categorical data type defined in Messy Tables. The most suitable candidate is String type:

types.type_guess(rows, types=[<class 'messytables.types.StringType'>, <class 'messytables.types.DecimalType'>, <class 'messytables.types.IntegerType'>, <class 'messytables.types.DateType'>, <class 'messytables.types.BoolType'>], strict=False)

The type guesser aggregates the number of successful conversions of each column to each type, weights them by a fixed type priority and select the most probable type for each column based on that figure. It returns a list of CellType. Empty cells are ignored.

Strict means that a type will not be guessed if parsing fails for a single cell in the column.

Continue to Pandas.

doctorf72 commented 8 years ago

This is taken from Messy Tables Documentation. type_guess method from types class trying to guess column type by accounting number of successful conversions. Unfortunately, no Categorical data type defined in Messy Tables. The most suitable candidate is String type:

types.type_guess(rows, types=[<class 'messytables.types.StringType'>, <class 'messytables.types.DecimalType'>, <class 'messytables.types.IntegerType'>, <class 'messytables.types.DateType'>, <class 'messytables.types.BoolType'>], strict=False)

The type guesser aggregates the number of successful conversions of each column to each type, weights them by a fixed type priority and select the most probable type for each column based on that figure. It returns a list of CellType. Empty cells are ignored.

Strict means that a type will not be guessed if parsing fails for a single cell in the column.

Continue to Pandas.