Closed lucia-ramos-dominguez closed 1 day ago
Original code for the attributes
The code chatgpt gave to try to solve this
The error that I keep getting. I have pasted this error in the chat multiple times and even if i tried the new code that it gave me it still ends up with an error
We believe that the issue might be due to some of the attributes having a / and some of them not having it. I don't think that all of them follow the same structure.
We found out that the attributes are all encoded differently it is not only true or false, other values like "quiet", "average" are there
@srosh2000 we already did thjis kinda code for data prep:
cleaned_data <- business %>% mutate(attributes = gsub("'", '"', attributes))
cleaned_data <- business %>% mutate(attributes = gsub('u"', '"', attributes))
cleaned_data <-business %>% mutate(attributes = gsub('\\"', '"',attributes))
parsed_json <- fromJSON(cleaned_data$attributes)
cleaned_data <- cleaned_data %>% mutate(attributes = lapply(attributes, fromJSON))
cleaned_data <- business %>% mutate(is_valid_json = validate(attributes))
invalid_json_rows <- cleaned_data %>% filter(is_valid_json == FALSE)
cleaned_data <- business %>% mutate(attributes = gsub("\\", "\\\\", attributes)) %>% # Handle escaped characters mutate(attributes = gsub("'", '"', attributes)) %>% # Replace single quotes mutate(attributes = gsub("\n", "\n", attributes)) %>% # Escape newlines mutate(attributes = lapply(attributes, fromJSON))
business_attribute <- cleaned_data %>% rowwise() %>% mutate(attributes = list(fromJSON(attributes))) %>% unnest_wider(parsed_json)
business$attributes[2] ?unnest_wider
But we get this error:
Error in mutate()
:
ℹ In argument: attributes = list(fromJSON(attributes))
.
ℹ In row 2.
Caused by error:
! lexical error: invalid char in json text.
alse", "BusinessParking": "{"garage": False, "street": True,
(right here) ------^
Code for the branch we created for this issue:
git fetch origin git checkout 42-extract-attributes-from-data-set
@lucia-ramos-dominguez Please try out this code snippet, worked for me when I tried it locally:
library(dplyr)
library(tidyr)
library(stringr)
library(jsonlite)
# Function to parse the attribute string into a named list
parse_attributes <- function(attr_string) {
# Clean the string to make it valid JSON syntax
cleaned_string <- gsub("u'", "'", attr_string) # Remove the 'u' prefix
cleaned_string <- gsub("'", "\"", cleaned_string) # Replace single quotes with double quotes
cleaned_string <- gsub("\\\\\"", "\"", cleaned_string) # Remove escape characters before quotes
cleaned_string <- gsub('"(True|False)"', '\\L\\1', cleaned_string, perl = TRUE) # Convert "True"/"False" to lowercase
cleaned_string <- gsub(": None", ": null", cleaned_string) # Replace None with null
# Remove any invalid escaping inside JSON strings
cleaned_string <- gsub('(?<=:)\\s*""(.*?)""', '"\\1"', cleaned_string, perl = TRUE)
# Ensure the string is enclosed in curly braces
cleaned_string <- paste0("{", str_remove_all(cleaned_string, "^\\{|\\}$"), "}")
# Attempt to parse the string into a list
parsed_list <- tryCatch(fromJSON(cleaned_string), error = function(e) NULL)
return(parsed_list)
}
# Apply the parse_attributes function to each row and convert to a dataframe
parsed_attributes <- data_business %>%
mutate(attributes_list = lapply(attributes, parse_attributes)) %>%
unnest_wider(attributes_list)
# Display the resulting dataframe
parsed_attributes
It worked! Thank you
This is about the issue I mentioned in class. You can look at the data exploration file for this. Currently the column for the attributes, is set as a dictionary (I think that´s what you mentioned). Every attribute has a true or false value linked to it.
"i want to parse the attribute column such that for each hotel i can see the attribute it has.. so use the dictionary.. so for eg. if the attribute RestaurantsDelivery is mentioned then create a new column and make it binary so if it is true for the hotel then 1 else 0 " this is part of the prompt that you asked chatgpt.
Using the code that chat gpt gave me i still get an error and it doesn´t separate the attribute name from the true or false or create the binary column as we want to create a dummy variable from this.