German Über Alles! - Githubissues

Did you see how the vectors cluster? If you haven't, don't hold your breath, it's not pretty. Here's what scipy plots with method='ward' (none of the others is much better): dend German is a strong outlier, as well as the largest corpus.

When I try to cluster into a predefined number of classes (17) based on the number of first-level language families we have in the dataset (using sklearn.cluster.AgglomerativeClustering), I get 12 singletons: Latin Greek Portuguese Spanish German Romanian Estonian Finnish Arabic French Czech Ancient Greek All other languages are (unevenly) distributed among five classes.

I guess data.txt contains pre-normalized values, right? But in any case, some of the distribution parameters (like kurtosis) have pretty ridiculous ranges and we might have to get rid of them.

I also saw that you got rid of the pos-rel pairs, which brought the dimensions down to 490. Were those the empty values you were talking about yesterday?

desialex / corpora_stats

German Über Alles! #13