OSMPH / dpwh_bridges

challenge repo for DPWH's Road and Bridges Inventory dataset
5 stars 3 forks source link

Data Pre-processing #2

Open govvin opened 2 years ago

govvin commented 2 years ago

Formating and Style Guide

See https://github.com/OSMPH/dpwh_bridges/wiki/Processing/_edit#formating-and-style-guidelines-aka-style-manual

govvin commented 2 years ago

Province-level clean-up tasks:

To do

Tasks:

mdgabriel1 commented 2 years ago

Hi @govvin , Completed checking the following provinces:

thiscaspar commented 1 year ago

I used a script to do some mass-cleanup (see my fork)

Result is in a googlesheet for easy review: https://docs.google.com/spreadsheets/d/1tPG7NJx7EuEXjY7HCg8oY9Cyqwz1H0JrXsSf5dyE_k0/edit#gid=893538581

Barangay/Municipalities are mostly fixed automatically. To give you an idea, this is the code:

function cleanName(str) {
    return str
        .replaceAll('Br.', 'Bridge')
        .replace('(NB)', ' (Northbound)')
        .replace(' NB)', ' Northbound)')
        .replace('NB)', ' Northbound)')
        .replace(' NB ', ' Northbound ')
        .replace('(SB)', '(Southbound)')
        .replace(' SB ', ' Southbound ')
        .replace('(WB)', '(Westbound)')
        .replace(' WB ', ' Westbound ')
        .replace('(EB)', '(Eastbound)')
        .replace(' EB ', ' Eastbound ')
        .replace('Gov. ', 'Governor ')
        .replace('Arch Reyes', 'Archbishop Reyes')
        .replace(/(.*)( \d)/g, "$1 №$2")
        .replace('  ', ' ')
        .replace('  ', ' ')
        .replace('( ', '(')
}

function cleanMunicipality(str) {
    return str
        .replace(/\s+/g, ' ').trim()
        .replace(/$\s(.*)/, "$1")
        .toLowerCase()
        .split(' ')
        .map(word => word.charAt(0).toUpperCase() + word.substring(1))
        .join(' ')
        .replace('Sta.', 'Santa')
        .replace('Sta ', 'Santa ')
        .replace('Zambonga City', 'Zamboanga City')
        .replace("Brookes Point", "Brooke's Point")
        .replace("Brook's Point", "Brooke's Point")
        .replace("Busuanga, Palawan", "Busuanga")
        .replace(", Cebu", "")
        .replace(", Sorsogon City", "")
        .replace(", Ilocos Norte", "")
        .replace(", Rizal", "")
        .replace(",lanao Del Norte", "")
        .replace(", Lanao Del Norte", "")
        .replace(", Agusan Del Sur", "")
        .replace(", Province Of Dinagat Islands", "")
        .replace(", Leyte", "")
        .replace(", N. Samar", "")
        .replace(", Quezon", "")
        .replace(",zds.", "")
        .replace(", Cam. Sur", "")
        .replace(", N Samar", "")
        .replace(",capiz", "")
        .replace(", Ilocos Sur", "")
        .replace(",zamboanga Del Sur", "")
        .replace(", Zamboanga Del Sur", "")
        .replace(" ,tarlac", "")
        .replace(", Albay", "")
        .replace(", Palawan", "")
        .replace(", Northern Samar", "")
        .replace(",tarlac", "")
        .replace(", Zds.", "")
        .replace("Sergio Osmena, Sr.", "Sergio Osmeña")
}

function cleanBarangay(str) {
    return str
        .replace(/\s+/g, ' ').trim()
        .replace(/$\s(.*)/, "$1")
        .toLowerCase()
        .split(' ')
        .map(word => word.charAt(0).toUpperCase() + word.substring(1))
        .join(' ')
        .replace('Brgy. ', '')
        .replace('Bgy. ', '')
        .replace('Barangay ', '')
        .replace('Sta.', 'Santa')
        .replace('Sta ', 'Santa ')
        .replace('Sto.', 'Santo')
        .replace('Sto ', 'Santo ')
        .replace('Brgys.', 'Barangays')
        .replace('Brgys.', 'Barangays')
        .replace('Pob.', 'Poblacion')
        .replace('Pobl;acion', 'Poblacion')
        .replace('Herero-perez', 'Herrero-Perez')
        .replace('New Bususnga', 'New Busuanga')
}

There are still 69 missing Barangays, and 19 missing municipalities. Some are not "clean" yet (having region in it, or messy formatting for multiple Barangays).

I stuck to using " № X" for bridge numbering, it seems the cleanest.

Hope this helps, let me know if anything needs adjustment.