Closed hanaldo closed 8 years ago
Temporarily for how long? And why temporarily? Why not option '1' now?
Why are you are leaning toward '2'
We can expect that Chinese speakers will have their own separate data. My feeling is that if the data will ultimately have different forms for different directions (X->Y, Y->X) then we should begin training the program to deal with this structure.
Let me know your thoughts.
For the second approach, we can do it until we mine a real Chinese->English dictionary/source. And because we haven't solved the issue https://github.com/Marc-Bogonovich/Openwords/issues/55 yet, so this will be just a temporary solution.
For the first approach, we are only able to make IDENTICAL/REDUNDANT data now. For example, we now only have a connection about 1(w1)->2(w2), to follow the first approach we will then insert a new record of 2(w1)->1(w2), which is IDENTICAL to 1(w1)->2(w2) and doing nothing. Also, this whole procedure will make "word_connections * 2", i.e. double the entire "word_connections" table with redundant data that has no use for now nor in the future (as we decided previously we will mainly updating data by inserting new records, not modifying them).
"Chinese speakers will have their own separate data" is just a subset of the entire word_connections data, the second approach does not conflict with this requirement. Neither the first nor the second approach is modifying the current data structure, the only additional task in the second approach is to write a separate query (only one line of code) while keeping the current data and data structure intact, there is no harm at all for doing the second approach.
In addition, at last meeting your answer to this question {3) If somehow we know the meaning connections "fly=飞" and "fly=fliegen", then does it mean "飞=fliegen"?} is "YES". Which means, if fly=飞 and fly=fliegen, then it implies 飞=fliegen, and it also implies fliegen=飞, and so (飞=fliegen)==(fliegen=飞), this is a supporting argument for the second approach.
Shenshen, thank you for the detailed notes. I'm still leaning toward '1'
Let me explain my reasoning. There is a fairly good biological analogy here that I will describe. The harm in the second approach is very subtle, because it exists not in the present, but in the future.
1) Just because data are identical now, doesn't mean the data will remain identical. The data must be prepared to be identified by users/teachers as a potential problem records (as separate records).
There is a biological analogy here. Tetrapods (you, me, my niece, salamanders, frogs, and elephants) have arms and legs. The arms and legs did not spontaneously begin to diverge. Initially, the controlling DNA was the same. A change in DNA led to the same change in both arm and leg (which would be called fore-limb and hind-limb). But then the DNA were duplicated (redundant). However, initially, the DNA and arms and legs were still identical. Initially, the two limb types were were still similar (identical) - they simply now had the capacity to acquire individual structure. The issue here is very similar, the initial existence of forelimbs and hind-limbs was due to a repetition event, followed by a subfunctionalization event (separating the information/control over the repeated parts), followed by differentiation only if and when necessary.
To say the records should remain bounded together is like saying (~500 million years ago) that the genetic control of forelimbs and hind-limbs should remain the same because the Tetrapods we see around us have identical forelimbs and hind-limbs.
We will initially create forelimbs and hind-limbs, and they will be simple copies. That is due to our current mining process. Much of the differentiation of the db might be instead not based on mining but based on slow evolution. We cannot make the requirement that the data be must non-identical as a requirement for their separate housing - the separate housing/control facilitates (allows) their differentiation.
2) The data aren't actually the same now (or will need to be different very soon). I'm not sure we can say the data will be the same right now. Two attributes of the word connection table might be simple flips. And there are currently clarification and definition records associated with the L1 and L2 words that are connected to the words table not the words connection table. But there will imminently be clarifications/notes attached to the connection. These notes are currently (conceptually) in the connection tags table, because their nature will depend on empirical observations of the connections, as contributors like Anahit work through the data. These records will be in the L1 about the L2/L1 connection - so very direction specific.
I think an important issue is the following: We don't know exactly how LA->LB vs. LB->LA will be asymmetrical. We can't anticipate it well, as it is sort of an empirical issue. Anahit is about half-way through the Farsi words. That process will give us some insight.
3) Is it really one line of code? Maybe it is just one line of code but: And wouldn't it be of equivalent simplicity (as a task) to just make duplicate records.
4) There is a general principle involved as well related to commitments about reality. An asymmetrical structure "is saying less" about language's structure. While a symmetrical structure "is saying more" about language's structure. Thus, the asymmetrical structure is more likely to be correct, by not being incorrect.
I understand that our mining is 'implying' that symmetry exists, but this is a 'practicality' to quickly solve the problem about getting data into the database, not a commitment we are making about the structure of languages.
*Digression: To this day, parts of the arms and legs remain identical in almost all Tetrapods (almost all have 5-digits/5-digits, though salamanders are 4-digits/5-digits illustrating the ability to be different). Other lineages have very different arms and legs (e.g. Primates).
Well, Marc, after review your explanation I'd say I am even more leaning to option 2. Here are my reasons: 1) I cannot argue with the biological analogy, and I also cannot argue that every object in this universe is unique (as far as I know), but it will be a totally different case when you deal with logic and computer software. We human make abstractions and assumptions to solve logic problems, if you say everything is unique then “1+1” will never equal to “2”, because in a lot of cases “1+1=1”. But we have to assume that “1=1” to deduce “1+1=2” in order to solve some logic and mathematical problems.
It is same as you and me are both human, we are identical in terms of a biological species, but we are different person. So we cannot neglect the concept of “human” due to we are different person. Also the concept of “human” does not conflict with the concept of “we are different person” because “different person” is a subset of “human”. Thus, “uni-direction” is also a subset of “bi-direction”, it will be never a harm if we assume there is a class called “bi-direction” as well as there is a class called “human”.
2) The data about “LA->LB vs. LB->LA” could be asymmetrical now or in the future, but, the option 1 is just inserting a same record twice. In another word, I am not saying that the data we are going to insert is definitely redundant, but the programming efforts we are going to make is definitely redundant, because we can clearly see that even if we will mine a real Chinese->English dictionary later and insert asymmetry word connections in the “word_connections” table, we WILL insert NEW data records instead of modifying the existing ones, so there is no needs for inserting the dummy asymmetry word connections now.
3) Yes, it will be just one line of extra code. Right now we are querying (English, Chinese) to get word connections, please see the current query code done by Mayukh (https://github.com/Marc-Bogonovich/Openwords/blob/master/openwords_server/WordsDB/class/mysql/ext/WordConnectionsMySqlExtDAO.class.php#L12). So to get word connections for Chinese learners for now, I just need to write a separate query that swaps the positions of English and Chinese in the query (also the positions of word1_id and word2_id), such as (Chinese, English). This is how much programming effort we need to do for option 2.
In contrary, for the option 1, we need to write a client program and a server program to request and write back the connection records. Also, currently we have 469593 records in the current “word_connections” table, so I will not insert another 469593 records of dummy data while knowing that in the future we will insert another 469593 (or more) records of real data when we mine Chinese->English, German->English, Japanese->English, Spanish->English… dictionaries.
4) “Assuming the word connections are bi-direction” is not a deny of the reality, we are only making abstractions to reflect the reality in different perspectives. No matter the reality is asymmetry or symmetry, they do not conflict with each other, as two asymmetry can be combined into one symmetry, and one symmetry can be dissected into two asymmetry as well. But, the redundant and unnecessary effort of making asymmetry data is a lot more than the effort of assuming we already have a pair of asymmetry data.
And at the meantime, please give us an example of a pair of asymmetric word_connection records, it’ll always be better if we can discuss based on a concrete example rather than making too many analogies.
In addition, we have not yet decided the normalization of the "word_connections" table. This is another reason of why inserting dummy asymmetry word_connections data now is a redundant effort.
There's a third option that comes to mind: Add a 'directionality' column to the db. This can have three different values, corresponding roughly to "bidirection", "left-to-right", or "right-to-left". For now, it could be assumed that everything is bidirectional, but in the future, individual entries could be tweaked.
It think it solves both issues, i.e. no data needs to be duplicated, and no code will need to be written that will need to be changed in the future (though the code could for now only deal with bidirectional data, and be expended later (as opposed to rewritten)).
@jonorthwash I couldn't agree more.
JNW's solution may bypass this concern for now. But I'm looking down the road, and am worried.
I'll post thoughts now. And then look in even more detail later tonight.
Two issues are affecting my thinking - That work together. I think you are more concerned about redundancy and that may cause the difference in conclusion. 1) I don't worry about redundancy at all (in the same way that the Wiktionary does not worry about redundancy) 2) the Chinese to English records would not be duplicate records. They should not be thought of as redundant either. Instead, they would be different records with identical data from the same source. The difference is conceptual not literal - though there are literal differences as well. The records would have different IDs, and different focal words. And the fact that the records serve very different individuals (with zero overlap) means that over time they will diverge considerably in the supporting tables.
The concern is that learning Chinese from English and English from Chinese are profoundly different activities (zero individuals are doing both). And I'm concerned about even starting on the path of thinking about them symmetrically.
Let me ask the following question about the alternatives:
So in one solution we make new records for the Chinese learner. In the other solution, we change the program so that a Chinese learner uses records that were used for English to Chinese.
This is a correct interpretation of the differences, correct?
@Marc-Bogonovich 1) I didn't say the real data will be a redundancy, I wanted to say the dummy data which we are going to insert in the word_connections table NOW (the option-1) is redundant, and they will be redundant forever because when we have real Chinese->English data we will then insert new records into the word_connections table instead of MODIFYING the existing ones. So, I only want to express that the option-1 is inserting dummy data and this leads to redundant programming effort (which I really don't want to waste).
2) No, that is not the difference. The difference is that, in option-1 we are MAKING dummy Chinese->English data based on the current Enlgish->Chinese data we have, and in option-2 I ASSUME we already have Chinese->English data based on the current Enlgish->Chinese data we have. We are not changing any data structures or the concept of the asymmetry connections at all, that's why I said in the very beginning it is a "temporary" solution only for bring the development forward, otherwise our current app is only for English learners (people speak English natively) not for Chinese learners. Does this make sense?
By the way, I don't know what is wrong with "starting on the path of thinking about them symmetrically", we can't even think about it? I mean I could learn English from both a English->Chinese dictionary and a Chinese->English dictionary, and I could learn Chinese from these two dictionaries as well, what is the problem?
Well. Think about your dictionary.
You said you could learn from a Chinese to English dictionary and then also an English to Chinese dictionary.
but my concern is that that is not an accurate description of these kinds of dictionaries.
For example I don't have a Chinese to English dictionary and an English to Chinese dictionary in one book. Instead, I have an English Chinese/English dictionary. Where I can look up meaning from either direction. The supporting information is all in English. The latter description is more accurate of the thing I have. And I think that difference is crucial.
(There also exists in the world Chinese Chinese/English dictionaries)
I think we agree on all facts, and could come to a decision but it would need to be in person.
I can be persuaded to your position, or not. But I think we need to talk in person.
There's another issue tangled up with this one: a speaker of Language A learning Language B may want to learn by guessing words they ~know in Language B by being given words in Language A, or they may want to go from words in Language B and guess their meaning in Language A. As I understand it, work to this point has assumed that there's no difference in storage of the data needed for these, but perhaps this assumption needs some extra thought too.
I agree that this conversation would be best continued in person. I'd be happy to join the discussion if you'd like me to. I can try to think of some examples to problematise the various approaches in the meantime.
@Marc-Bogonovich When I learn English, sometimes for fun I look up a Chinese word in a dictionary for example, and to get those supporting information in English, then I can look up those English supporting information in another dictionary to get supporting information in Chinese, this way I can tell which words/phrases are more common (used for explaining basic concepts) in English. This is an option, but I don't mean that this option should overwrite all other options.
However, those (the things we discussed above) are still a digression to the gist of this issue. For this issue (please review my issue content), I am ONLY asking to assume we already have Chinese->English (and other) data instead of inserting new but dummy Chinese->English (and other) data, so that I can start to program the features for learners who natively speak Chinese, Japanese, German, Korean, Spanish......
@hanaldo @jonorthwash I trust your judgement Shenshen. I've got serious worries about this, but if we discuss it through, and you still think it a reasonable way to move for the time being, then that can be ok.
Do you want to treat the data in the manner you have suggested for right now? I'd like to discuss the issue on Saturday 1:00pm meeting.
Reading through this in more detail, one thing that is encouraging that I see is that implied in your reasoning, Shenshen, is that you think we will ultimately be successful in mining activities in Chinese->English data sources.
When Archie and I considered the problem we saw limitation as the guiding principle. Namely, Archie would point out the limitations of the Chinese->English Wiktionary. Thus, I was thinking of the English->Chinese switch, not as a temporary trick, but as the sole and first means at our disposal to build a Chinese->English db.
Reflecting on the issue, your optimism is probably correct. Thus, I'm starting to be persuaded. But as both @hanaldo and @jonorthwash mentioned above, there are actually several issues wrapped into this problem.
Hi Marc, I do believe we will be successful in mining another Chinese->English data source, also another Spanish->English data source etc., there will be a lot of methods. However, mining a Chinese->English data source and input those new word connections into our database will be totally a separate issue comparing to this one, because there are a lot of factors we need to consider, such as the origin of words source, the words quantity of the source, the type of the words source, the merging/non-merging of the new connections with existing connections, etc. For example, a typical Chinese->English data source may not fully reveal the knowledge of Chinese language, we may even have to mine a Chinese->Chinese dictionary and translate those definitions to English (and Spanish, Japanese, German, French etc.). Also, what about the words in Medical area, Chemistry area, even Computer Science area. So I don't think a typical Chinese->English data source can cover everything (and this apply to Chinese->Japanese, Chinese->Spanish, Chinese->German... as well), but our database of words should not stop growing, so there are lot of things to discuss before we are really going to choose and mine a Chinese->English data source.
So for this issue, it is not about how to mine a real Chinese->English data source, it's about bring the multi-language learning experiences alive to our current app and thus introduce more concrete problems/questions in design, data structure and mining activities. By the end of this week, I'll finish the Chinese localization feature, and it will be a good starting point to implement the English (or other languages) learning experience for Chinese users, but we are not mining any other data source yet except the English Wiktionary, so I just want to keep our current data and data structure intact while I can still implement/configure the learning modules for Chinese users, and thus "assuming we already have the Chinese->English data based on what we already have" seems to be the safest approach.
By the way, please make a separate issue about "how to build a Chinese->English db" or "how to find and mine a Chinese->English data source", we will have a lot of work to do for that issue later on.
Also there is a huge difference between English culture and Chinese culture in terms of language and words, this is fairly important to know before we are going to choose and mine a Chinese data source, which I can talk you more on Saturday, but please consider this as a separate issue.
As we have discussed many times about issue https://github.com/Marc-Bogonovich/Openwords/issues/55, we really need to make a step forward. Also when the Chinese localization (https://github.com/Marc-Bogonovich/Openwords/issues/62) is finished, we need to have Chinese->English connections for Chinese people to learn English. Now we have two approaches: 1) We'll just swap the word1_id and word2_id in word_connections table to insert those new semi-duplicated records.
2) We'll temporarily assume the relationship in word_connections is bi-direction, so we just need a separate query for word2_id->word1_id.
I personally recommend the second approach.