Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
680 stars 131 forks source link

Mindmapping micromeanings #2409

Open blackslate opened 4 years ago

blackslate commented 4 years ago

Let's benefit from the Tatoeba corpus and community to create a mindmap of how different languages express meaning.

Overview of the issue:

Different words in different languages can have multiple translations. Certain translations may be appropriate in certain circumstances, but wrong in others.

Example: good ≠ bien

"How are you?" "Good." « Comment vas-tu ? » «Bien.»

"I'm going to be late." "OK..." « Je vais être en retard. » « Bien... »

While "good" is a perfect translation for "bien" in some circumstances, it is inappropriate in others. "Good" implies encouragement, while "bien" can imply mere acceptance.

The same word can have different meanings which do not precisely overlap.

"durable goods" « des biens durables » "household goods" « articles ménagers » "common welfare" « le bien public »

In a similar vein, different languages have different names for different colours. Some languages group a range of colours under a single name, others have more precise names for subsets of the same range of colours.

Words that appear to be perfectly synonymous like "taxi" and "cab" are used differently in practice. When you stand on the side of the street to call a cab, you shout "Taxi!"

Metaphor

The totality of all words in all languages defines a multidimensional meaning space of all the concepts that the human brain has (so far) been able to distinguish. Each language is a fractal map of this meaning space. Each word in a given language is defined by its opposition to (the) other words in that language. Each word defines a meaning bubble around itself, enclosing a nember of concepts and excluding others. When speaking in a given language, we choose the word that is closest to the meaning we want to express, and it brings with it all the other meanings found within its bubble.

Hypothesis

It should be possible:

  1. To create a node for each concept that is expressible in at least one language
  2. To connect each word in any given language to all the concept-nodes that it can be used to express.

The result would be a multidimensional map (mindmap) of the conceptual basis for each language, and of the strengths and weaknesses of different languages to express particular concepts.

Benefits

Practical considerations

This kind of multidimensional map could be stored in a graph database like Neo4j. This allows you to create relationships between nodes, where each relationship has a direction and can have custom properties.

This work would require the collaboration of a great number of bilingual people. The Tatoeba community seems an ideal place to find people who could benefit from participating in such an project.

The most common words have the greatest complexity of meaning.

Functional words (prepositions, conjunctions, modal verbs, ...) might be very difficul

Dialects of a given language may map certain words differently.

Each contributor could start by:

As soon as there is a sufficient number of entries, other users can start to review them, with possible actions being:

LBeaudoux commented 4 years ago

I like the idea of adding a semantic layer to Tatoeba. Besides, we don't have to build a semantic graph from the scratch since a multilingual one named BabelNet already exists.

Tagging sentences with synsets would enable the classification of the Tatoeba search results by meaning. For a given searched word, It would also help us to identify the meanings that are poorly covered.

But tagging tasks can be quite boring. The semantic tagging workflow would have to be seamless and enjoyable for contributors. For example, relying on a tool such as BabelFly would make it possible to suggest synsets to contributors who could then validate or correct them.

jiru commented 4 years ago

There is also this semantic graph for French from the crisco research laboratory. It is used a as basis for their synonyms dictionary.

@blackslate Your idea is amazing. But I think it’s probably way out of the scope of the Tatoeba project, which aim is to build a sentence dictionary. We are already so busy just with that.

That said, if the mindmap was to be a separate project, maintained by another team, I’m quite sure Tatoeba could collaborate in a way or another. The first use case that comes to my mind is simple corpus analysis. If you see 10 sentences in language A, each containing the word a, and each translated into sentences containing the word b in language B, there is very high probability that a link between nodes A(a) and B(b) should be created. And the reverse too: if there is no such link in Tatoeba, we could suggest translating any existing A(a) or B(b) into Tatoeba. Anyway, there could be some synergy between a sentence-level project like Tatoeba and a word-level project like what you’re describing.

Your idea sounds too good to be new. Aren’t there any similar initiatives out there already?

Your idea looks like a multilingual dictionary put into a graph database. The folks from Wikidata recently imported the Wiktionary into their graph database, so I think they did more or less what you said. But I’m not so familiar with Wikidata, I can only give you a link to the word mother in English. I don’t think anything can beat Wikimedia nowadays when it comes to building commons, so using Wikidata as a base might be the best option (you can even query their database in graphql).