Mindmapping micromeanings

blackslate commented 4 years ago

Let's benefit from the Tatoeba corpus and community to create a mindmap of how different languages express meaning.

Overview of the issue:

Different words in different languages can have multiple translations. Certain translations may be appropriate in certain circumstances, but wrong in others.

Example: good ≠ bien

"How are you?" "Good." « Comment vas-tu ? » «Bien.»

"I'm going to be late." "OK..." « Je vais être en retard. » « Bien... »

While "good" is a perfect translation for "bien" in some circumstances, it is inappropriate in others. "Good" implies encouragement, while "bien" can imply mere acceptance.

The same word can have different meanings which do not precisely overlap.

"durable goods" « des biens durables » "household goods" « articles ménagers » "common welfare" « le bien public »

In a similar vein, different languages have different names for different colours. Some languages group a range of colours under a single name, others have more precise names for subsets of the same range of colours.

Words that appear to be perfectly synonymous like "taxi" and "cab" are used differently in practice. When you stand on the side of the street to call a cab, you shout "Taxi!"

Metaphor

The totality of all words in all languages defines a multidimensional meaning space of all the concepts that the human brain has (so far) been able to distinguish. Each language is a fractal map of this meaning space. Each word in a given language is defined by its opposition to (the) other words in that language. Each word defines a meaning bubble around itself, enclosing a nember of concepts and excluding others. When speaking in a given language, we choose the word that is closest to the meaning we want to express, and it brings with it all the other meanings found within its bubble.

Hypothesis

It should be possible:

To create a node for each concept that is expressible in at least one language
To connect each word in any given language to all the concept-nodes that it can be used to express.

The result would be a multidimensional map (mindmap) of the conceptual basis for each language, and of the strengths and weaknesses of different languages to express particular concepts.

Benefits

Language learners could explore the subtle differences of meaning of words that they are learning
Machine translators could explore the mindmap to find the word whose "bubble" most closely corresponds to a given word in the source language
Writers could choose the nodes that colour a particular thought that they want to express, and then consider the words that have the shortest distance from those nodes
The importance of minority languages which are particulary rich in expressions in a given area (e.g. behaviour of animals in the Kalahari desert) could be objectively measured
Each new word that appears in a language will have its place; clusters of such new words would indicate the emergence of trending ideas
Linguists could use the mindmap to explore the merits of linguistic determinism

Practical considerations

This kind of multidimensional map could be stored in a graph database like Neo4j. This allows you to create relationships between nodes, where each relationship has a direction and can have custom properties.

This work would require the collaboration of a great number of bilingual people. The Tatoeba community seems an ideal place to find people who could benefit from participating in such an project.

The most common words have the greatest complexity of meaning.

Functional words (prepositions, conjunctions, modal verbs, ...) might be very difficul

Dialects of a given language may map certain words differently.

Each contributor could start by:

Choosing one of the commonest words in their native language
Creating a node with a unique id for each of this word's most common meanings
Providing example sentences for each particular meaning
Creating relationships between these new nodes and any other words (in the same or other languages) that can be used to express this exact meaning
Creating relationships between each new node and other nodes that have a related (similar, opposite) meaning or usage.

As soon as there is a sufficient number of entries, other users can start to review them, with possible actions being:

Adding new links to other languages. As in the "good/bien" example above. this might lead to an understanding that there are more subtlties of meaning that need to be defined.
Splitting an existing node into two or more meanings, and dividing the examples accordingly.
Adding new examples
Adding new relationships with other nodes

LBeaudoux commented 4 years ago

I like the idea of adding a semantic layer to Tatoeba. Besides, we don't have to build a semantic graph from the scratch since a multilingual one named BabelNet already exists.

Tagging sentences with synsets would enable the classification of the Tatoeba search results by meaning. For a given searched word, It would also help us to identify the meanings that are poorly covered.

But tagging tasks can be quite boring. The semantic tagging workflow would have to be seamless and enjoyable for contributors. For example, relying on a tool such as BabelFly would make it possible to suggest synsets to contributors who could then validate or correct them.

jiru commented 4 years ago

There is also this semantic graph for French from the crisco research laboratory. It is used a as basis for their synonyms dictionary.

@blackslate Your idea is amazing. But I think it’s probably way out of the scope of the Tatoeba project, which aim is to build a sentence dictionary. We are already so busy just with that.

That said, if the mindmap was to be a separate project, maintained by another team, I’m quite sure Tatoeba could collaborate in a way or another. The first use case that comes to my mind is simple corpus analysis. If you see 10 sentences in language A, each containing the word a, and each translated into sentences containing the word b in language B, there is very high probability that a link between nodes A(a) and B(b) should be created. And the reverse too: if there is no such link in Tatoeba, we could suggest translating any existing A(a) or B(b) into Tatoeba. Anyway, there could be some synergy between a sentence-level project like Tatoeba and a word-level project like what you’re describing.

Your idea sounds too good to be new. Aren’t there any similar initiatives out there already?

Your idea looks like a multilingual dictionary put into a graph database. The folks from Wikidata recently imported the Wiktionary into their graph database, so I think they did more or less what you said. But I’m not so familiar with Wikidata, I can only give you a link to the word mother in English. I don’t think anything can beat Wikimedia nowadays when it comes to building commons, so using Wikidata as a base might be the best option (you can even query their database in graphql).

Tatoeba / tatoeba2