MODA-NYC / db-recovery-data-partnership

Data pipelines for datasets that are part of the Recovery Data Partnership project
https://www1.nyc.gov/site/analytics/initiatives/recovery-data-partnership.page
12 stars 8 forks source link

BetaNYC: normalization of attribute values #8

Closed AmandaDoyle closed 4 years ago

AmandaDoyle commented 4 years ago

Below captures some categories that might apply to BetaNYC data

Input value uptowngrandcentral sunnyside east-village north brooklyn Output value
Open Open Opened Open Open
Closed Closed Closed Closed
Follow-up Unknown Follow-up ; No Answer NULL
Input value uptowngrandcentral sunnyside east-village north brooklyn Output value
true Yes; Yes (Call ahead) true Yes ; Yes (Call ahead) yes
false - false No no

NULL / blank is mapped to NULL

Input value uptowngrandcentral east-village north brooklyn Output value
true Yes; Yes (Call for delivery) ; Yes - limited zone true Yes ; Yes (Call for delivery); Yes (Call ahead); Yes - only in special cases; Yes - limited zone yes
false No false No no

NULL / blank is mapped to NULL

Input value uptowngrandcentral sunnyside east-village north brooklyn Output value
true true true - yes

NULL / blank is mapped to NULL

Input value uptowngrandcentral sunnyside east-village north brooklyn Output value
true true true - yes

NULL / blank is mapped to NULL

SPTKL commented 4 years ago

instead of overwriting category and subcategory, should we look at them together and derive a dcp_category? e.g. we can have category = resturant and bars but subcategory as cafe, ideally this should be categorized as resturant and cafe, but the current recoding will convert it to resturant and bars

AmandaDoyle commented 4 years ago

instead of overwriting category and subcategory, should we look at them together and derive a dcp_category? e.g. we can have category = resturant and bars but subcategory as cafe, ideally this should be categorized as resturant and cafe, but the current recoding will convert it to resturant and bars

I like this idea. Would you like to draft a proposal? Or I can work on a proposal after standup

SPTKL commented 4 years ago

WIP: category

select 
    distinct 
    (CASE 
        WHEN betanyc_category ~* 'retail|bike|Bicycle|store|Pet Supplies|Pharmacy' 
            THEN 'Dry retail' 
        WHEN betanyc_category ~*  'service|free food|Laundromat|Pantry|Wellness|'||
                            'Soup Kitchen|Health|Doctor|Deliveries' 
            THEN 'Services'
        WHEN betanyc_category IS NULL THEN NULL
        ELSE 'Food and beverage'
    END) as category, 
    betanyc_category, 
    betanyc_subcategory
from betanyc.latest
order by category, betanyc_category

Looking at betanyc_subcategory:

select 
    distinct 
    (CASE 
        WHEN betanyc_subcategory ~* 'wine|beer|tea|market|food|bakery|bar|restaurant|grocery|cafe' 
        OR betanyc_category ~* 'bakery|bar|grocer|restaurant|liquor|dessert|coffee|pantry|soup kitchen|food' 
            THEN 'Food and beverage'
        WHEN betanyc_subcategory ~* 'clothing|fashion|shoe|goods|electronics|hardware|books|home|bicycle|supplies|pet'
        OR betanyc_category ~* 'retail|bike|bicycle|store|pet|pharmacy' 
            THEN 'Dry retail' 
        WHEN betanyc_subcategory ~* 'bank|finance|government|repair|shipping|fitness|copies|law|flor|'
        OR betanyc_category ~*  'laundromat|wellness|health|doctor|community service|shipping' 
            THEN 'Services'
        ELSE 'Other'
    END) as category, 
    betanyc_category, 
    betanyc_subcategory
from betanyc.latest
order by category, betanyc_category
mgraber commented 4 years ago

WIP: Categories with subcategories

WITH cat AS (SELECT 
    DISTINCT 
    (CASE 
        WHEN betanyc_subcategory ~* 'wine|beer|tea|market|food|bakery|bar|restaurant|grocery|cafe' 
        OR betanyc_category ~* 'bakery|bar|grocer|restaurant|liquor|dessert|coffee|pantry|soup kitchen|food' 
            THEN 'Food and beverage'
        WHEN betanyc_subcategory ~* 'clothing|fashion|shoe|goods|electronics|hardware|books|home|bicycle|supplies|pet|stationar|building|dollar store|appliance|pharma'
        OR betanyc_category ~* 'retail|bike|bicycle|store|pet|pharmacy|art|yarn' 
            THEN 'Dry retail' 
        WHEN betanyc_subcategory ~* 'bank|finance|government|repair|shipping|fitness|copies|law|flor'
        OR betanyc_category ~*  'laundromat|wellness|health|doctor|community service|shipping' 
            THEN 'Services'
        ELSE 'Other'
    END) AS category, 
    betanyc_category, 
    betanyc_subcategory
FROM betanyc.latest
ORDER BY category, betanyc_category)

SELECT
    *, 
    (CASE 
        WHEN category='Food and beverage'
        THEN CASE
            WHEN betanyc_subcategory ~* 'diner|meal*takaway|meal*delivery|restaurant'
                OR betanyc_category ~* 'restaurant' 
            THEN 'Restaurant'
            WHEN betanyc_subcategory ~* 'bakery|cafe|coffee|cream|tea|cake|bagel|sandwich'
                OR betanyc_category ~* 'bakery|dessert|coffee' 
            THEN 'Bakeries cafes and desserts'
            WHEN betanyc_subcategory ~* 'grocer|market|convenience|spice|butcher'
                OR betanyc_category ~* 'grocer' 
            THEN 'Grocery and market'
            WHEN betanyc_subcategory ~* 'wine|bar|beer|liquor|night*club' 
                OR betanyc_category ~* 'bar|liquor'
            THEN 'Alcohol and bars'
            WHEN betanyc_subcategory ~* 'pantry|soup kitchen|free' 
            THEN 'Free food'
            ELSE 'Other food and beverage'
        END
        WHEN category='Dry retail'
        THEN CASE
            WHEN betanyc_subcategory ~* 'bike|bicycle'
                OR betanyc_category ~* 'bike|bicycle' 
            THEN 'Bike'
            WHEN betanyc_subcategory ~* 'hardware|garden|home|furniature|flor|appliance|dollar'
                OR betanyc_category ~* 'hardware' 
            THEN 'Hardware and home goods'          
            WHEN betanyc_subcategory ~* 'pet'
                OR betanyc_category ~* 'pet' 
            THEN 'Pet supplies'
            WHEN betanyc_subcategory ~* 'pharma|health|drug'
                OR betanyc_category ~* 'pharma' 
            THEN 'Health and beauty'    
            WHEN betanyc_subcategory ~* 'cloth|fashion|shoe|jewel' 
            THEN 'Apparel'
            WHEN betanyc_subcategory ~* 'electronic|computer'
            THEN 'Electronics'
            WHEN betanyc_subcategory ~* 'book|stationery|gifts|art'
            THEN 'Books art and gifts'
            ELSE 'Other retail'
        END
        WHEN category='Services'
        THEN CASE
            WHEN betanyc_subcategory ~* 'fitness|gym|pilates|yoga|martial arts'
            THEN 'Fitness'
            WHEN betanyc_subcategory ~* 'doctor|dentist|health|medic'
            THEN 'Healthcare'
            WHEN betanyc_subcategory ~* 'financ|bank|tax|account|insurance|law'
            THEN 'Finance and legal'
            WHEN betanyc_subcategory ~* 'hair|beauty|salon'
            THEN 'Beauty'
            WHEN betanyc_subcategory ~* 'laundr'
            THEN 'Laundry'
            WHEN betanyc_subcategory ~* 'copies|copy|shipping|post|photo'
            THEN 'Copies and shipping'
            WHEN betanyc_subcategory ~* 'repair'
            THEN 'Repair'
            WHEN betanyc_subcategory ~* 'veter'
            THEN 'Veterinary'
            WHEN betanyc_subcategory ~* 'flor'
            THEN 'Floral'
            ELSE 'Other service'
        END
    ELSE 'Other'
    END) as subcategory
FROM cat
ORDER BY category, subcategory, betanyc_category, betanyc_subcategory;