davidmasp / data-visualization

This is a repository to host data visualizations
6 stars 0 forks source link

language signatures #2

Open davidmasp opened 1 year ago

davidmasp commented 1 year ago

The idea is to build a classification algorithm for languages based on the frequency of occurence of a given charfacter set. These would work similar than mutational signatures.

The idea is to download all wikipedia in different languages and build an easy classifier.

Here there seems to be a list of xml compressed wikipedia sites that include an index that can be used to go to specific pages etc. Here is to use the comrpessed data.

davidmasp commented 1 year ago

It would be cool to combine with the translator tool from facebook.

See here