Details

Harmonized and combined the datasets (issue #9 ), as well as implemented the correlation analyzer.

Changes

Implemented a script called harmonize_data.py, chich generates the final output data
- Requires several inputs
  - Processed Hotel reviews at datasets/tripadvisor_output.xlsx
  - Crime data in a directory at datasets/metropolitan-street
- Doesn't use file limits currently for crime data; I included all crime data from Metropolitan Street area; overall more than 3 million rows (some of which get's dropped out during the process)
- The data is committed to this repo, location at datasets/final_data_avg.csv
The data is combined per sector area; they're slightly smaller than post code districts
- Each sector contains count of hotel reviews, count of crimes, area (m^2), and currently averages of extracted features
  - It should be noted that we could also use other methods for combining the extracted features, such as median or sum
The correlation analysis is done in a Jupyter Notebook file, ending with .ipynb
- The Notebook can be opened and the code can be run within the VSC if you just install the requirements from requirements.txt
- The correlation analysis focuses on finding a variable that explains variable crime_density
  - Meaning, that crime_density is in plots on y-axis and we try to find a good variable for x-axis
  - In terms of statistical analysis, we're trying to find an explaining variable for explained variable crime_density, using NLP methods
- The correlation analysis contains four steps:
  1. Initialize data. Calculate new column: criminal_density, which is basically crime count divided by area (m^2)
  2. Visualize some data: generate scatter plots for crime_density and extracted variables (starting with sentiment_)
  3. Correlation analysis. First test the distribution of data per metric variables, and decide which method to use. Results show that data is normally distributed, hence we should be good with Pearson's method. Measure correlation between all variables and produce a correlation matrix.
  4. Findings and conclusions
- We could also consider searching for multiple explaining variables, although it's probably a long shot

Things left to be done

The repository will contain (at least) the processed hotel review data, as well as the final, combined and harmonized data. We should probably review the contents of the repository and prepare some sort of instructions on how to do all relevant steps of the project (e.g. sourcing original data, generating crime heatmaps, extracting features, running the data combination and correlation analysis). Luckily, some of those steps are already documented. You guys have other ideas / thoughts?

Old task description

Implement a way to compare heatmaps (matrices) Should give a float between 0 and 1 as output, where 1 means that the matrices are identical

Kivike / nlp-project

Correlation analyzer #4

Details

Changes

Things left to be done

Old task description