Harmonized and combined the datasets (issue #9 ), as well as implemented the correlation analyzer.
Changes
Implemented a script called harmonize_data.py, chich generates the final output data
Requires several inputs
Processed Hotel reviews at datasets/tripadvisor_output.xlsx
Crime data in a directory at datasets/metropolitan-street
Doesn't use file limits currently for crime data; I included all crime data from Metropolitan Street area; overall more than 3 million rows (some of which get's dropped out during the process)
The data is committed to this repo, location at datasets/final_data_avg.csv
The data is combined per sector area; they're slightly smaller than post code districts
Each sector contains count of hotel reviews, count of crimes, area (m^2), and currently averages of extracted features
It should be noted that we could also use other methods for combining the extracted features, such as median or sum
The correlation analysis is done in a Jupyter Notebook file, ending with .ipynb
The Notebook can be opened and the code can be run within the VSC if you just install the requirements from requirements.txt
The correlation analysis focuses on finding a variable that explains variable crime_density
Meaning, that crime_density is in plots on y-axis and we try to find a good variable for x-axis
In terms of statistical analysis, we're trying to find an explaining variable for explained variable crime_density, using NLP methods
The correlation analysis contains four steps:
Initialize data. Calculate new column: criminal_density, which is basically crime count divided by area (m^2)
Visualize some data: generate scatter plots for crime_density and extracted variables (starting with sentiment_)
Correlation analysis. First test the distribution of data per metric variables, and decide which method to use. Results show that data is normally distributed, hence we should be good with Pearson's method. Measure correlation between all variables and produce a correlation matrix.
Findings and conclusions
We could also consider searching for multiple explaining variables, although it's probably a long shot
Things left to be done
The repository will contain (at least) the processed hotel review data, as well as the final, combined and harmonized data. We should probably review the contents of the repository and prepare some sort of instructions on how to do all relevant steps of the project (e.g. sourcing original data, generating crime heatmaps, extracting features, running the data combination and correlation analysis). Luckily, some of those steps are already documented. You guys have other ideas / thoughts?
Old task description
Implement a way to compare heatmaps (matrices)
Should give a float between 0 and 1 as output, where 1 means that the matrices are identical
Details
Harmonized and combined the datasets (issue #9 ), as well as implemented the correlation analyzer.
Changes
harmonize_data.py
, chich generates the final output datadatasets/tripadvisor_output.xlsx
datasets/metropolitan-street
datasets/final_data_avg.csv
.ipynb
requirements.txt
crime_density
crime_density
is in plots on y-axis and we try to find a good variable for x-axiscrime_density
, using NLP methodscriminal_density
, which is basically crime count divided by area (m^2)crime_density
and extracted variables (starting withsentiment_
)Things left to be done
The repository will contain (at least) the processed hotel review data, as well as the final, combined and harmonized data. We should probably review the contents of the repository and prepare some sort of instructions on how to do all relevant steps of the project (e.g. sourcing original data, generating crime heatmaps, extracting features, running the data combination and correlation analysis). Luckily, some of those steps are already documented. You guys have other ideas / thoughts?
Old task description
Implement a way to compare heatmaps (matrices) Should give a float between 0 and 1 as output, where 1 means that the matrices are identical