Carceral-Ecologies / Carceral-OSHA-Data

GNU General Public License v3.0
2 stars 3 forks source link

Cleaning establishment names and addresses #11

Closed savannahmhunter closed 3 years ago

savannahmhunter commented 4 years ago

I used openrefine to clean up these columns: city and address. I still need to clean the establishment name. Doing so will allow us to look at violations by each unique facility.

savannahmhunter commented 3 years ago

Notes for text cleaning of facility names. Remove: # - & (replace with AND) . We also need to come up with some kind of function that will assign a unique ID # or name to a facility that has the same number (CA DEPT of CORRECTIONS) but has a different address or city or zip code.

savannahmhunter commented 3 years ago

Create a new data cleaning file and do the following on facility names in the prison_insp_viol_2010_op.csv dataset.

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space (see number 5 under "other issues")
  3. Remove - and replace with space
  4. Remove parentheses "(" and ")" and replace with no space
  5. Remove apostrophe " ' " and replace with no space
  6. Trim leading and trailing space
  7. Remove "/" and replace with space
  8. Remove "#" and replace with no space
  9. Remove "&" and replace with "and"
  10. Replace "CALIFORNIA" with "CA"
  11. Replace "DEPTT" with "DEPT"
  12. Replace "DEPARTMENT" with "DEPT"
  13. Replace "REHABILITATION" with "REHAB"
  14. Replace "CALIF" with "CA"
  15. Replace "FORT" with "FT"

Also need to clean addresses:

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space
  3. Replace "STREET" with "ST"
  4. Replace "AVENUE" with "AVE"
  5. Replace "DRIVE" with "DR"
  6. Replace "ROAD' with "RD"
  7. Replace " WEST " with " W " (the spaces might be important if some street is called Westover Lane or something)
  8. Replace " SOUTH " with " S " (spaces are important) - check this address "31625 HIGHWAY 101 SOUTH"
  9. Replace " EAST " with " E " (spaces are important)
  10. Replace " NORTH " with " N " (spaces are important)
  11. Remove - and replace with space
  12. Trim leading and trailing space
  13. Remove "#" and replace with no space
  14. Remove "&" and replace with "and"
  15. Remove parentheses "(" and ")" and replace with no space
  16. Remove "ROAD" with "RD"
  17. Remove "BOULEVARD" with "BLVD"
  18. Remove "'" with no space
  19. Remove "WSETERN" and replace with " WESTERN"
  20. Replace "HIGHWAY" with "HWY"

Other Issues:

  1. We need to look at all addresses that have the same street name but different street numbers to check if they are actually unique facilities. Example: activity_nrs 343566097 and 343787164. OR 341926327 and 342607322 (this looks like the same company to me).
  2. We need to look at addresses that have the same street number but different addresses. Example: Has S Central and Central avenue 316345941 and 316346022. Are these the same facility? They also have slightly different names.
  3. We may need to check for any facilities with the same name but different addresses or cities or zipcodes. I can see federal bureau of prisons repeated but they are different facilities because they are in different states.
  4. In NY we have an issue where a facility has very similar names and addresses but they are slightly different. This will need to be double checked and probably fixed through a documented change in the code or in OpenRefine. See activity numbers 310860705, 310862602.
  5. This facility name is going to become a problem when we remove the period and replace with no space: activity_nr = 301375960 name = V.I.DEPARTMENT OF JUSTICE/BUREAU OF CORRECTIONS. We will have to fix this ourselves in the code or using OpenRefine.
  6. Puerto Rico is also in the data set. It would be great if a native speaker could look at the facility names and addresses and let us know if there are any additional issues there. Particularly abbreviations that we will have dealt with in English (ex: change department to dept) but will not be fixed in Spanish.

After we do all of this should open the dataset in Openrefine again and do text faceting to catch any little things.

shapironick commented 3 years ago

This looks great!

In terms of other issues 1, this can be tricky as there are often huge complexes of different prisons/jails/detention centers in the same compound. At some points it feels very arbitrary what is the same facility and what are multiple facilities. We defered to the HIFLD data I sent in a different issue for how the lines were drawn between facilities. I also don't know how OSHA defines a facility. EPA defines it in a somewhat maddening way where a single prison could have different facilities (water treatment plant, sewage treatment, air polluting boilers etc).

On Tue, Oct 13, 2020 at 7:07 PM Savannah Hunter notifications@github.com wrote:

Create a new data cleaning file and do the following on facility names in the prison_insp_viol_2010_op.csv dataset.

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space
  3. Remove - and replace with space
  4. Remove parentheses "(" and ")" and replace with no space
  5. Remove apostrophe " ' " and replace with no space
  6. Trim leading and trailing space
  7. Remove "/" and replace with space
  8. Remove "#" and replace with no space
  9. Remove "&" and replace with "and"
  10. Replace "CALIFORNIA" with "CA"
  11. Replace "DEPTT" with "DEPT"
  12. Replace "DEPARTMENT" with "DEPT"
  13. Replace "REHABILITATION" with "REHAB"
  14. Replace "CALIF" with "CA"

Also need to clean addresses:

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space
  3. Replace "STREET" with "ST"
  4. Replace "AVENUE" with "AVE"
  5. Replace "DRIVE" with "DR"
  6. Replace "ROAD' with "RD"
  7. Replace " WEST " with " W " (the spaces might be important if some street is called Westover Lane or something)
  8. Replace " SOUTH " with " S " (spaces are important)
  9. Replace " EAST " with " E " (spaces are important)
  10. Replace " NORTH " with " N " (spaces are important)
  11. Remove - and replace with space
  12. Trim leading and trailing space
  13. Remove "#" and replace with no space
  14. Remove "&" and replace with "and"
  15. Remove parentheses "(" and ")" and replace with no space

Other Issues:

  1. We need to look at all addresses that have the same street name but different street numbers to check if they are actually unique facilities. Example: activity_nrs 343566097 and 343787164. OR 341926327 and 342607322 (this looks like the same company to me).
  2. We need to look at addresses that have the same street number but different addresses. Example: Has S Central and Central avenue 316345941 and 316346022. Are these the same facility? They also have slightly different names.
  3. We may need to check for any facilities with the same name but different addresses or cities or zipcodes. I can see federal bureau of prisons repeated but they are different facilities because they are in different states.

After we do all of this should open the dataset in Openrefine again and do text faceting to catch any little things.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/11#issuecomment-708109365, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH3345FXLS2H4K4GXQLY3SKUBVLANCNFSM4NLTTUJQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

shapironick commented 3 years ago

We also saw lots of small typos so sounds great to have another pass of refine for the typos that aren't caught by your great list of cleaning.

On Wed, Oct 14, 2020 at 1:52 PM Nicholas Shapiro nickshapiro@ucla.edu wrote:

This looks great!

In terms of other issues 1, this can be tricky as there are often huge complexes of different prisons/jails/detention centers in the same compound. At some points it feels very arbitrary what is the same facility and what are multiple facilities. We defered to the HIFLD data I sent in a different issue for how the lines were drawn between facilities. I also don't know how OSHA defines a facility. EPA defines it in a somewhat maddening way where a single prison could have different facilities (water treatment plant, sewage treatment, air polluting boilers etc).

On Tue, Oct 13, 2020 at 7:07 PM Savannah Hunter notifications@github.com wrote:

Create a new data cleaning file and do the following on facility names in the prison_insp_viol_2010_op.csv dataset.

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space
  3. Remove - and replace with space
  4. Remove parentheses "(" and ")" and replace with no space
  5. Remove apostrophe " ' " and replace with no space
  6. Trim leading and trailing space
  7. Remove "/" and replace with space
  8. Remove "#" and replace with no space
  9. Remove "&" and replace with "and"
  10. Replace "CALIFORNIA" with "CA"
  11. Replace "DEPTT" with "DEPT"
  12. Replace "DEPARTMENT" with "DEPT"
  13. Replace "REHABILITATION" with "REHAB"
  14. Replace "CALIF" with "CA"

Also need to clean addresses:

  1. Remove commas "," replace with space
  2. Remove all periods and replace with no space
  3. Replace "STREET" with "ST"
  4. Replace "AVENUE" with "AVE"
  5. Replace "DRIVE" with "DR"
  6. Replace "ROAD' with "RD"
  7. Replace " WEST " with " W " (the spaces might be important if some street is called Westover Lane or something)
  8. Replace " SOUTH " with " S " (spaces are important)
  9. Replace " EAST " with " E " (spaces are important)
  10. Replace " NORTH " with " N " (spaces are important)
  11. Remove - and replace with space
  12. Trim leading and trailing space
  13. Remove "#" and replace with no space
  14. Remove "&" and replace with "and"
  15. Remove parentheses "(" and ")" and replace with no space

Other Issues:

  1. We need to look at all addresses that have the same street name but different street numbers to check if they are actually unique facilities. Example: activity_nrs 343566097 and 343787164. OR 341926327 and 342607322 (this looks like the same company to me).
  2. We need to look at addresses that have the same street number but different addresses. Example: Has S Central and Central avenue 316345941 and 316346022. Are these the same facility? They also have slightly different names.
  3. We may need to check for any facilities with the same name but different addresses or cities or zipcodes. I can see federal bureau of prisons repeated but they are different facilities because they are in different states.

After we do all of this should open the dataset in Openrefine again and do text faceting to catch any little things.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/11#issuecomment-708109365, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH3345FXLS2H4K4GXQLY3SKUBVLANCNFSM4NLTTUJQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366

savannahmhunter commented 3 years ago

Nick that's a great point about how OSHA defines facilities. I found this definitions page. I will save this for us. It looks like they define establishment in a similar way to the EPA.

The term establishment means a single physical location where business is conducted or where services or operations are performed. Where distinctly separate activities are performed at a single physical location, each activity shall be treated as a separate establishment. Typically, an establishment as used in this part refers to a field activity, regional office, area office, installation, or facility.

savannahmhunter commented 3 years ago

This has been finished for California. As we find other facility name and address issues we can post them here.

shapironick commented 3 years ago

Great find and progress!

savannahmhunter commented 3 years ago

Our work is never done! lol I realized today in reshaping the dataset by facility name that I need to double check that generic prison names like STATE OF CA DEPT OF CORRECTIONS are actually only facility and are not the name of several facilities. Spot checking the datafile I can see facilities with generic names (like the one above) have several different cities attached. Also there are names for prisons in Avenal and Corcoran which have slightly different names but the same addresses. We will need to make a decision about how to treat those.

shapironick commented 3 years ago

Yes it’s so tricky!

Here is our protocol (you have to click through once more) where it outlines how we made similar decisions.

https://github.com/Carceral-Ecologies/lab_book#manual-coding-prisons-jails-and-detention-centers-with-epa-id-protocol

Sent on the move

On Nov 10, 2020, at 7:49 PM, Savannah Hunter notifications@github.com wrote:

 Our work is never done! lol I realized today in reshaping the dataset by facility name that I need to double check that generic prison names like STATE OF CA DEPT OF CORRECTIONS are actually only facility and are not the name of several facilities. Spot checking the datafile I can see facilities with generic names (like the one above) have several different cities attached. Also there are names for prisons in Avenal and Corcoran which have slightly different names but the same addresses. We will need to make a decision about how to treat those.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

savannahmhunter commented 3 years ago

I believe I have finished cleaning the facility names and addresses! The clean data file is in the google drive and is called CA_OSHA_prison_insp_viol_2010_2019_cleaned_long.

shapironick commented 3 years ago

Incredible!! big congrats, Savannah!

On Tue, Jan 26, 2021 at 12:31 PM Savannah Hunter notifications@github.com wrote:

I believe I have finished cleaning the facility names and addresses! The clean data file is in the google drive and is called CA_OSHA_prison_insp_viol_2010_2019_cleaned_long.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Carceral-Ecologies/Caceral-OSHA-Data/issues/11#issuecomment-767809009, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZH33ZDW46XBU23XSGBWFTS34Q75ANCNFSM4NLTTUJQ .

-- Nicholas Shapiro Assistant Professor UCLA Institute for Society and Genetics Office: (310) 206-2366 Book my office hours here https://calendly.com/n-s/office-hours-monday-1-30-3-30?

savannahmhunter commented 3 years ago

I am closing this issue. If people find issues with the facility names or addresses they can open a new issue and request changes.