PythonDataScience24 / AirBnB-DataScienceProject

GNU General Public License v3.0
2 stars 0 forks source link

Data Wrangling: explain the challenges posed by the inconsistent data set #20

Closed bdravec closed 4 months ago

bdravec commented 4 months ago

create DATA_WRANGLING.md file and insert the findings about the data set. is linked to isse #10 clean data set thoroughly.

Problem description

in the "NAME" column, of 102'349 entries, we only have 61'281 distinct values. this especially interesting, since it's a text field that describes the listing (1 br with a view OR skyscraper in unique building etc.) with several words. This bears the question: why do we have so many duplicate entries in a textfield with ONLY 250 MISSING VALUES but at least half of the descriptions are duplicates? We don't know at this point how to distinguish which entry is the correct one. there is no timestamp with an 'last updated' information.

Analysis

Proposed solution there is an unnamed column, that could be a database counter: whenever a new entry is created, the counter goes up. we cannot be sure however, that this is the case. if it was, the entry with the highest counter would be the most up-to-date entry.

bdravec commented 4 months ago

What is the column 'calculated host listings count'. it doesn't seem to be the number of listings each host has. I don't know what it could be. here a couple of tests that show it's not related to the number of listings: # Create a DataFrame where host name is Aldus aldus_df = df_data[df_data['host name'] == 'Aldus'] aldus_df --> output are 3 listings that have Aldus as a host, however, each of those have 'calculated host listings count' of 3, 2, and 1 respectively.

# Create a DataFrame where host name is Madaline madaline_df = df_data[df_data['host name'] == 'Madaline'] madaline_df --> output are 4 listings that have Madaline as a host, however the 'calculated host listings count' in Madalines row is 6

bdravec commented 4 months ago

@bdravec frage Riccardo, was wir tun sollen

bdravec commented 4 months ago

Email sent to Riccardo Dear Riccardo

I hope this finds you well. While wrangling the Airbnb data set we’ve encountered a disturbing inconsistency:

Is this a well-known inconsistency in the data set (new listing is created when user updates a listing)? Without knowing which listings are the ‘updated’ ones, we cannot clean the data very well.

Thanks for your help and insights!

Kindest regards, The AirBnB-Team Barbara, Lukas, Viola, Romano