desialex / corpora_stats

1 stars 0 forks source link

German Über Alles! #13

Open desialex opened 5 years ago

desialex commented 5 years ago

Did you see how the vectors cluster? If you haven't, don't hold your breath, it's not pretty. Here's what scipy plots with method='ward' (none of the others is much better): dend German is a strong outlier, as well as the largest corpus.

When I try to cluster into a predefined number of classes (17) based on the number of first-level language families we have in the dataset (using sklearn.cluster.AgglomerativeClustering), I get 12 singletons: Latin Greek Portuguese Spanish German Romanian Estonian Finnish Arabic French Czech Ancient Greek All other languages are (unevenly) distributed among five classes.

I guess data.txt contains pre-normalized values, right? But in any case, some of the distribution parameters (like kurtosis) have pretty ridiculous ranges and we might have to get rid of them.

I also saw that you got rid of the pos-rel pairs, which brought the dimensions down to 490. Were those the empty values you were talking about yesterday?

flareau commented 5 years ago

Yes, data.txt is pre-normalized. It’s exactly what’s in the .pickle files. The script vectorize.py loads them and normalizes them before vectorizing. This is where we can play with data we’d want to throw away.

I did get rid of pos-pairs at some point because it was messy. I can put it back.