course-dprep / consumer-rating-insights-Yelp

course-dprep-classroom-fall-2024-team-project-no-vs-code-template-team-project created by GitHub Classroom
MIT License
0 stars 0 forks source link

Extract Attributes from data set #42

Closed lucia-ramos-dominguez closed 1 day ago

lucia-ramos-dominguez commented 2 days ago

This is about the issue I mentioned in class. You can look at the data exploration file for this. Currently the column for the attributes, is set as a dictionary (I think that´s what you mentioned). Every attribute has a true or false value linked to it.

"i want to parse the attribute column such that for each hotel i can see the attribute it has.. so use the dictionary.. so for eg. if the attribute RestaurantsDelivery is mentioned then create a new column and make it binary so if it is true for the hotel then 1 else 0 " this is part of the prompt that you asked chatgpt.

Using the code that chat gpt gave me i still get an error and it doesn´t separate the attribute name from the true or false or create the binary column as we want to create a dummy variable from this.

lucia-ramos-dominguez commented 2 days ago

image

Original code for the attributes

lucia-ramos-dominguez commented 2 days ago

image The code chatgpt gave to try to solve this

lucia-ramos-dominguez commented 2 days ago

image The error that I keep getting. I have pasted this error in the chat multiple times and even if i tried the new code that it gave me it still ends up with an error

lucia-ramos-dominguez commented 2 days ago

We believe that the issue might be due to some of the attributes having a / and some of them not having it. I don't think that all of them follow the same structure.

lucia-ramos-dominguez commented 2 days ago

We found out that the attributes are all encoded differently it is not only true or false, other values like "quiet", "average" are there image

image

asaarloos commented 2 days ago

@srosh2000 we already did thjis kinda code for data prep:

cleaned_data <- business %>% mutate(attributes = gsub("'", '"', attributes))

cleaned_data <- business %>% mutate(attributes = gsub('u"', '"', attributes))

cleaned_data <-business %>% mutate(attributes = gsub('\\"', '"',attributes))

parsed_json <- fromJSON(cleaned_data$attributes)

cleaned_data <- cleaned_data %>% mutate(attributes = lapply(attributes, fromJSON))

cleaned_data <- business %>% mutate(is_valid_json = validate(attributes))

invalid_json_rows <- cleaned_data %>% filter(is_valid_json == FALSE)

cleaned_data <- business %>% mutate(attributes = gsub("\\", "\\\\", attributes)) %>% # Handle escaped characters mutate(attributes = gsub("'", '"', attributes)) %>% # Replace single quotes mutate(attributes = gsub("\n", "\n", attributes)) %>% # Escape newlines mutate(attributes = lapply(attributes, fromJSON))

business_attribute <- cleaned_data %>% rowwise() %>% mutate(attributes = list(fromJSON(attributes))) %>% unnest_wider(parsed_json)

business$attributes[2] ?unnest_wider

But we get this error: Error in mutate(): ℹ In argument: attributes = list(fromJSON(attributes)). ℹ In row 2. Caused by error: ! lexical error: invalid char in json text. alse", "BusinessParking": "{"garage": False, "street": True, (right here) ------^

lucia-ramos-dominguez commented 2 days ago

Code for the branch we created for this issue:

git fetch origin git checkout 42-extract-attributes-from-data-set

srosh2000 commented 2 days ago

@lucia-ramos-dominguez Please try out this code snippet, worked for me when I tried it locally:

library(dplyr)
library(tidyr)
library(stringr)
library(jsonlite)

# Function to parse the attribute string into a named list
parse_attributes <- function(attr_string) {
  # Clean the string to make it valid JSON syntax
  cleaned_string <- gsub("u'", "'", attr_string)  # Remove the 'u' prefix
  cleaned_string <- gsub("'", "\"", cleaned_string)  # Replace single quotes with double quotes
  cleaned_string <- gsub("\\\\\"", "\"", cleaned_string)  # Remove escape characters before quotes
  cleaned_string <- gsub('"(True|False)"', '\\L\\1', cleaned_string, perl = TRUE)  # Convert "True"/"False" to lowercase
  cleaned_string <- gsub(": None", ": null", cleaned_string)  # Replace None with null

  # Remove any invalid escaping inside JSON strings
  cleaned_string <- gsub('(?<=:)\\s*""(.*?)""', '"\\1"', cleaned_string, perl = TRUE)

  # Ensure the string is enclosed in curly braces
  cleaned_string <- paste0("{", str_remove_all(cleaned_string, "^\\{|\\}$"), "}")

  # Attempt to parse the string into a list
  parsed_list <- tryCatch(fromJSON(cleaned_string), error = function(e) NULL)

  return(parsed_list)
}

# Apply the parse_attributes function to each row and convert to a dataframe
parsed_attributes <- data_business %>%
  mutate(attributes_list = lapply(attributes, parse_attributes)) %>%
  unnest_wider(attributes_list)

# Display the resulting dataframe
parsed_attributes
lucia-ramos-dominguez commented 2 days ago

It worked! Thank you