TheDataStation / ver

Data Discovery Tools and Systems
MIT License
6 stars 10 forks source link

Label spatial temporal attributes and their granularities #70

Open snowgy opened 9 months ago

snowgy commented 9 months ago

Given a dataset, label attributes that are either spatial or temporal and their granularities.

For example, in this dataset 4jy7-7m68.csv

attribute date is a temporal attribute, and its granularity is day

attribute zip_code is a spatial attribute, and its granularity is zipcode

attribute location is a spatial attribute, and its granularity is geo-coordinate.

This feature needs to be integrated into the ddprofiler of Ver. ddprofiler profiles each attribute in the dataset and label the datatype. You could look at https://github.com/TheDataStation/ver/blob/main/quick_start_cli.md to see how to build and run ddprofiler.

ogiorgil commented 9 months ago

To store information of whether an attribute is temporal/spatial and its granularity, I assume then that we should create new attributes on Profile.java? Should temporal/spatial be considered as its own dataType, or should (for example) attribute date still be considered a String, and have another attribute in Profile.java to indicate that it is temporal?

snowgy commented 9 months ago

There are two options. 1. extend dataType from String to a Class. This class should be compatible with existing data types such as "text" or "number" and can handle spatio-temporal data types with granularity information. 2. keep dataType as String. When you label spatio-temporal attributes, assign dataType as "temporal" or "spatial", and use another field to store the granularity information. You can choose the one that you feel is easier to implement.

ogiorgil commented 9 months ago

Okay, thank you Yue. Now that I think about it, I think its best to keep the dataType, since we still may care (for example) if a zip code is stored as string or number. I will add two new attributes, spatialTemporalType to indicate whether its spatial or temporal, and an attribute to store its granularity.

luthfibalaka commented 9 months ago

Based on the discussion in email, the plan is to modify LabelAnalyzer.java to recognize spatial and temporal attributes (along with their granularities) with regex.

But, is there a defined list of granularities for each type that will be supported? (For now, based on the example above, we have [geo-coordinate, zipcode] for spatial and [day] for temporal.) Or, should we just add what we could identify and improve it later?

snowgy commented 9 months ago

For spatial attributes, you could browse chicago open data portal and find a list of common granularities to support. For temporal attributes, it is more straightforward. We want to support granularities along the temporal hierarchy such as second, min, hour, day, month, year.

luthfibalaka commented 9 months ago

Alright, thanks for the info. I will handle the location part, and I will inform my progress soon!

luthfibalaka commented 9 months ago

I created a gist to list granularity of location which I have discovered in Chicago Data Portal: here. As soon as I found another pattern, I will include it there.

Also, I will create a separate PR to include the regex patterns. Maybe it could be reviewed after David's PR (?)