Questions on Clustering Data and Script

creds2 / Excess-Data-Exploration

Repo for inital exploration of data in the Exess project

GNU Lesser General Public License v3.0

1 stars 0 forks source link

Questions on Clustering Data and Script #2

Open timchatterton opened 5 years ago

timchatterton commented 5 years ago

@mem48 Hi - thought I would test this out for some questions working through your script.

Is there a reason you don't use data.table?

Age Table - What are Mode1 and 2? (Most often and second most often?) EPC table - Crr=Current and Ptn=Potential? How can potential_Mode be lower than Current_Mode e.g. E01000001

"Joining factors with different levels" Error - I have bluffed my way through properly understanding factors for too long - I think I might need you to explain them for me please!

First k-means graph - no clear elbow? Presume this is why you go on to dendograms?

line 141 has fit in it (cutree(fit,k=13)) - but fit is not created to line 149 Even if you jump forward to 149 and run the fit<-pvclust line - you get an error going back to 142 for the groups<-cutree... Without this you can't create the groups to allocate to lsoa_house$hcluster in line 163 Any idea what needs tweeking to get those clusters allocated?

timchatterton commented 5 years ago

To cluster including energy usage or not?

With Energy usage certainly seems to make for more interesting results - especially with gas!

timchatterton commented 5 years ago

That second one above was without energy values, income, rooms or HHize.... This one is just without energy values - not much more elucidating...

timchatterton commented 5 years ago

I think my logic is that we need to group areas using both energy and some characteristics - in order to then explain high areas by the known characteristics - and then unknown variations which we explore by other means? I don't believe that it is possible to predict energy usage through social and structural factors alone - but once we get to identify some different groups of high usage areas, we can then identify low areas with the same (non-energy) characteristics and contemplate what the differences might be... though here we are clearly missing out a lot of key factors such as urban/rural location, employment, social profiling etc.

This was the cluster map for the top pair of bar charts - still working on interpretting it but now shutting down have a good weekend Clusters 2, 11, 5 and 12 (12 being off-gas grid areas I presume)

timchatterton commented 5 years ago

I have copied over the script I used for making 13 kmeans clusters to the joint folder. I have also created a TempTables folder and in their is a copy of the table from the clustering so that we can keep the same cluster names for now. I have also created a table with 12 clusters (using same methodology) as much easier to do multiple plot outputs (2x6,3x4) for 12 as opposed to 13!

timchatterton commented 5 years ago

These are the 13 clusters orderedf according to gas + electricity (and given letters)

timchatterton commented 5 years ago

Here with 12 clusters - they don't look too different - so will focus mainly on 12 for now - I have also updated the tables in TempTables to include these letters

timchatterton commented 5 years ago

And here are the division of clusters by LSOA Classification (Super Groups) (annoyingly I can't work out how to neatly/quickly force ggplot to do a full axis of A to L without missing out the no data clusters)

timchatterton commented 5 years ago

And here are the 23 groups

timchatterton commented 5 years ago

This was the script I was using... I:\Github\Excess-Data-Exploration\Tim\RScripts\Clusters\Look at 13 clusters by LSOAC.R

Robinlovelace commented 5 years ago

Good questions. Partial answer to this one:

Is there a reason you don't use data.table?

data.table is useful when speed is critical. In this situation it's not.

mem48 commented 5 years ago

@timchatterton are you using the classifications in the clustering or just describing the clusters with the classifications? Also, we know rural-urban matters, but kmeans is numeric only so I'm going to add population density to the set of input datasets.