Open snowgy opened 9 months ago
To store information of whether an attribute is temporal/spatial and its granularity, I assume then that we should create new attributes on Profile.java? Should temporal/spatial be considered as its own dataType, or should (for example) attribute date
still be considered a String, and have another attribute in Profile.java to indicate that it is temporal?
There are two options. 1. extend dataType from String to a Class. This class should be compatible with existing data types such as "text" or "number" and can handle spatio-temporal data types with granularity information. 2. keep dataType as String. When you label spatio-temporal attributes, assign dataType as "temporal" or "spatial", and use another field to store the granularity information. You can choose the one that you feel is easier to implement.
Okay, thank you Yue. Now that I think about it, I think its best to keep the dataType, since we still may care (for example) if a zip code is stored as string or number. I will add two new attributes, spatialTemporalType to indicate whether its spatial or temporal, and an attribute to store its granularity.
Based on the discussion in email, the plan is to modify LabelAnalyzer.java to recognize spatial and temporal attributes (along with their granularities) with regex.
But, is there a defined list of granularities for each type that will be supported? (For now, based on the example above, we have [geo-coordinate, zipcode] for spatial and [day] for temporal.) Or, should we just add what we could identify and improve it later?
For spatial attributes, you could browse chicago open data portal and find a list of common granularities to support. For temporal attributes, it is more straightforward. We want to support granularities along the temporal hierarchy such as second, min, hour, day, month, year.
Alright, thanks for the info. I will handle the location part, and I will inform my progress soon!
I created a gist to list granularity of location which I have discovered in Chicago Data Portal: here. As soon as I found another pattern, I will include it there.
Also, I will create a separate PR to include the regex patterns. Maybe it could be reviewed after David's PR (?)
Given a dataset, label attributes that are either spatial or temporal and their granularities.
For example, in this dataset 4jy7-7m68.csv
attribute
date
is a temporal attribute, and its granularity isday
attribute
zip_code
is a spatial attribute, and its granularity iszipcode
attribute
location
is a spatial attribute, and its granularity is geo-coordinate.This feature needs to be integrated into the ddprofiler of
Ver
. ddprofiler profiles each attribute in the dataset and label the datatype. You could look at https://github.com/TheDataStation/ver/blob/main/quick_start_cli.md to see how to build and run ddprofiler.