The idea is to build a classification algorithm for languages based on the frequency of occurence of a given charfacter set. These would work similar than mutational signatures.
The idea is to download all wikipedia in different languages and build an easy classifier.
Here there seems to be a list of xml compressed wikipedia sites that include an index that can be used to go to specific pages etc. Here is to use the comrpessed data.
The idea is to build a classification algorithm for languages based on the frequency of occurence of a given charfacter set. These would work similar than mutational signatures.
The idea is to download all wikipedia in different languages and build an easy classifier.
Here there seems to be a list of xml compressed wikipedia sites that include an index that can be used to go to specific pages etc. Here is to use the comrpessed data.