Do research about Linked Data resources like Wordnet and Lexvo

set120120 commented 2 months ago

Description

We need to create our path to develop our project. To be able to this, we need to know how to use linked data resources. The first mission of the project is learning Wordnet, Lexvo or similar thing like them.

Tasks

[x] Do research about Wordnet, Lexvo or something equivalent - @odenizddd
[x] Do research about Wordnet, Lexvo or something equivalent - @HutkuC
[x] Do research about Wordnet, Lexvo or something equivalent - @arincdemir
[x] Do research about Wordnet, Lexvo or something equivalent - @ramazanoacar
[x] Do research about Wordnet, Lexvo or something equivalent - @fahreddinozcan
[x] Do research about Wordnet, Lexvo or something equivalent - @semihYILMAZ37
[x] Do research about Wordnet, Lexvo or something equivalent - @set120120
[x] Do research about Wordnet, Lexvo or something equivalent - @ebrarkiziloglu

Estimated Time

2 days

Deadline

01.10.2024 12.00

Reviewer

No need to a reviewer

odenizddd commented 2 months ago

Also when we are making research, I think it is beneficial to keep in mind what we are trying to achieve.

We need to leverage a Linked Data resource in order to create quiz question. So we should be thinking about stuff like

How do we generate a question?
How do we generate distractors (wrong options) for the questions?
How do we determine the level of words so that we can show words that are close to the user's level?

Also other questions that you think might be relevant.

I wish everyone a great research session!

set120120 commented 1 month ago

While researching Linkeddata resources, I found a multilingual resource called BabelNet. We can also get our words from here. Here, words are separated according to their concepts, and this can be useful. Also, Wordnet is the most suitable Linkeddata resource for me right now. Especially, the fact that it gives the meanings of words as synonyms, antonyms, and close ones, separates them according to their concepts, and operates with a tree structure logic, makes it one step ahead of other Linkeddata in my eyes. Now, let's come to Deniz's 3 questions.

Answer to the first question: I think we should generally separate words according to their categories in order to create questions. After dividing the words among categories such as football, mathematics, science, art, transportation, etc., we should create quizzes by randomly selecting the appropriate words from the category's datasets.

Answer to the second question: I think there should definitely be words with similar spellings but completely different meanings among the wrong answers (like quiet and quite). There should also be words with opposite meanings (if any), words with similar meanings (how to determine this can be thought of later), and grammatical traps.

Answer to the third question: I can think of 2 ways for this. The first is the percentage of words used by referencing online sources, and the second is the difficulty according to the CEFR level that we will obtain from sources such as Cambridge, Oxford. Both are quite acceptable and logical solutions.

Also, I will explain my thoughts to you in a more detail at the next meeting.

odenizddd commented 1 month ago

Thank you so much for sharing your findings with us @set120120 .

I completely agree with you that BabelNet is a great resource for finding translations and we should definitely consider using it. The only problem with BabelNet is its API call rate which is limited to 1000 per day, however we can work around this by caching calls.

We should ask our instructor if we can use BabelNet, in the next lab. I think it has the potential to be the number one translation tool for our project.

fahreddinozcan commented 1 month ago

While doing the research I've also came across ConceptNet, which approaches the words as concepts, which would also provide related verbs etc. That sounded like a nice focus, since some of these libraries only focus on nouns. After comparing it to other available tools, it also occurred to me that wordnet is also the best option here, since it provides various approaches to words.

For the questions,

How do we generate a question?

For the quizzes generated by the user, we should just make the user select from one:

user inputs a topic/theme (fruit etc.)
- In this case, we iterate through neighbors of that word, also including hypernyms, hyponyms etc. so we create a mix from different relations.
all random
- This case also covers quizzes generated by us.

While selecting a word, we could consider these:

Difficulty of the word We could use NLTK to asses a word, since NLTK already contains some statistics about the word. Our function could be sth like this

difficulty_score = (
    (1 / (frequency + 1)) * 1000 +  # Rarer words are harder
    length * 0.5 +                  # Longer words are harder
    syllable_count * 2 +            # More syllables are harder
    sense_count * 1                 # More senses are harder
)

Words the user has already learned While doing the above , we should just keep some statistics (T/F, answering time etc.) to asses the user's progress, and if we should increase/decrease their level.
Spaced Repetition This is basically when the words should be reviewed. To achieve this, we could have the concepts below
- initial schedule After we show the user a word, we directly schedule it to be asked again. Something like in 3 questions, in 1 day etc...
- performance based review if the user's score related to that word is still not increasing, we should increase the frequency
- review sessions there could be a review quiz after 3 consecutive quizzes, including the previous quizzes info.

ebrarkiziloglu commented 1 month ago

Hello everyone! Thank you for sharing your inspiring findings @odenizddd @set120120 and @fahreddinozcan!

Here are some remarks from me:

As for the Linked Data sources, WordNet is a promising source, due to its tree-like structure. It also has a great documentation we could benefit from. I also checked BabelNet as it was mentioned in the above comments. And its translation functionality complements WordNet, and two of them together might be a good base for the data for our project.

As for answering the questions above:

How do we generate a question?

I want to answer separately, for the user-generated and randomly-generated quizzes. For the user-generated quizzes, I believe the question should be whether (1) we will allow users to add specific words to add to their quizzes, or (2) only allow them to choose one of the pre-defined concepts for the quiz, and determine the words from the chosen concept. If we go with choice (1), we need a system to reject/accept user-provided words, if they are from different levels of difficulty. I believe this can be solved by a wide range of solutions (we can elaborate more in the next meeting), but it might also be an overkill easily. Choice (2), which @fahreddinozcan also recommended above, looks safer and more concrete in terms of user-friendliness, but the users have less freedom in this case. Let us discuss this more.

For the randomly generated quizzes, we can take the number of questions for each type of questions from the user. And for any type of question, I agree with @set120120’s idea above. Generating words according to pre-defined categories sounds great.

How do we generate distractors (wrong options) for the questions?

Putting words with similar spellings but different meanings (quiet vs. quite) to the multiple choices sounds good to me, but it might be hard to implement, we can discuss. I’d go with words with similar meanings to put to the wrong options. For this, we can benefit from a relational tree of words. WordNet also has a function for similar words as the following:

wn.synsets("happy")[0].similar_tos()

The output is like the following:

[Synset('blessed.s.06'), Synset('blissful.s.01'), Synset('bright.s.09'), Synset('golden.s.02'), Synset('laughing.s.01')]

How do we determine the level of words so that we can show words that are close to the user's level?

To determine the difficulty level of the words, I agree on utilizing NLTK to develop a simple function, as @fahreddinozcan suggested above.

Let us ask to TA about using BabelNet and WordNet and discuss other remarks among ourselves to decide on requirements and proceed with our project!

ramazanoacar commented 1 month ago

I made my research on wordnet and babelnet while working on the proof of concept of quiz generation with the script, my process can be seen here.

arincdemir commented 1 month ago

In addition to the resources you mentioned, there is an API called datamuse.

import requests
response = requests.get('https://api.datamuse.com/words', params={'ml': 'run'})
similar_words = [word['word'] for word in response.json()]
print(similar_words)

It is working on a an algorithm that doesn't rely on machine learning.

Speaking of machine learning, there are models available that turn words into vectors. Of them is called Word2Vec These models can query the meaning distance between two words by calculating the cosine difference between them.

We can also use these models in theory; however, the synonyms naturally land the closest to each other. So, in order to find a word that is similar, but not the same as our query word would require additional steps to make sure they are not synonyms.

arincdemir commented 1 month ago

I am closing the issue since everyone seems to be done.

bounswe / bounswe2024group5