Open obo opened 10 years ago
We're checking for new e-mails every minute, so it should appear almost instantly (you can go to /admin/translations and sort by 'created' to check if the translation was added). However, we're picking the tweets for scoring completely at random, using this query (generated by CakePHP):
SELECT `Translation`.`post_id`, `Translation`.`lang_id`
FROM `tct`.`translations` AS `Translation`
WHERE 1 = 1
GROUP BY `Translation`.`post_id`, `Translation`.`lang_id`
HAVING COUNT(DISTINCT `Translation`.`text`)>=2
ORDER BY rand() ASC
LIMIT 1
How can we make it recency-weighted?
a while ago these formulas were designed https://github.com/cifkao/tct/blob/master/papers/doc/ranking they are slightly recency-weghted as in the most recent (few minutes) gets boost which drops quickly
This comment of mine did not make it from the e-mail to github:
A very simple (and not mathematically sound) idea:
ORDER BY
rand()*1/(1/2*(age_of_translation + age_of_original))
DESC
My particular design of the recency weight is rather stupid: 1 over the average age of the translation itself and the original.
I am marking this as a bug now since it prevents me from showing off how good we are in speedy translations. :-) If anyone can change the formula right now anything recency-based, it would be very useful.
So I did this:
ORDER BY
rand()*1/(
TO_SECONDS(NOW())
-0.5*(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created)))
) DESC
Didn't notice any change, but maybe that's because there are no really recent translations...
I've submitted quite a few translations since then and I don't see them in the scoring yet. Obviously, we need to change the formula after we check how it actually behaves on real data.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Saturday, 6 September, 2014 11:32:47 AM Subject: Re: [tct] when do the items for scoring get updated? (#21)
So I did this:
ORDER BY rand()*1/( TO_SECONDS(NOW()) -0.5*(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created))) ) DESC
Didn't notice any change, but maybe that's because there are no really recent translations...
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54707597
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
I'll provide some.
We should also have a max-lag limit. We should not publish translations older than something, as the news can be almost misleading due to the delay. It's however not easy to define it, it depends on the nature of the news. Perhaps something like: never publish a tweet older than 48 hours, if its author has published more than 4 tweets in the meantime.
On September 6, 2014 11:32:47 AM CEST, "Ondřej Cífka" notifications@github.com wrote:
So I did this:
ORDER BY rand()_1/( TOSECONDS(NOW()) -0.5(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created))) ) DESC
Didn't notice any change, but maybe that's because there are no really recent translations...
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54707597
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
Just to make sure we are on the same page, perfecting this formula was my plan for conference project And as described in the document I think it would help put put a bit of weight on tweet every time it is judged so we ensure equal distribution put weight means show less often On Sep 6, 2014 6:30 PM, "Ondrej Bojar" notifications@github.com wrote:
A very simple (and not mathematically sound) idea:
ORDER BY rand()1/(1/2(age_of_translation + age_of_original)) DESC
ie. a random number between 0 and a varying value for every translation
this varying number should somehow express the 'recency weight', the
more recent,
the higher the weight
this makes the most recent item most likely as its rand range covers
also higher
numbers that no other item can ever beat
My particular design of the recency weight is rather stupid: 1 over the average age of the translation itself and the original.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Friday, 5 September, 2014 5:13:24 PM Subject: Re: [tct] when do the items for scoring get updated? (#21)
We're checking for new e-mails every minute, so it should appear almost instantly (you can check admin/translations and sort by 'created' to see if the translation was added). However, we're picking the tweets for scoring completely at random, using this query (generated by CakePHP):
SELECT
Translation
.post_id
,Translation
.lang_id
FROMtct
.translations
ASTranslation
WHERE 1 = 1 GROUP BYTranslation
.post_id
,Translation
.lang_id
HAVING COUNT(DISTINCTTranslation
.text
)>=2 ORDER BY rand() ASC LIMIT 1How can we make it recency-weighted?
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54638006
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-54719506.
I've submitted quite a few translations since then and I don't see them in the scoring yet.
Are they in the database?
Yes, I saw them on my page as the translator.
On September 6, 2014 7:30:42 PM CEST, "Ondřej Cífka" notifications@github.com wrote:
I've submitted quite a few translations since then and I don't see them in the scoring yet.
Are they in the database?
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54721637
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
Could you please change the formula to something extremely simple so that most recent translations are scored first? Since we are scoring pairs, we should simply construct all (unordered) pairs and sort them by the average age for now. If made this simple, the single pair of two most recent translations would be growing scores and others would not get anything. So let's also add e.g. 6 hours to the average age for every scoring the pair already has. That will make sure that pairs with no scoring will get scored first (sorted by recency) and pairs with one scoring will get more scores only when there is no unscored pair younger than 6 hours. ...Well, as Eda says, perfecting this formula is the goal of the MTM project. :-)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(`Translation`.`created`) - `Translation`.`scoring_count`*60*30 DESC
So 30 minutes are added to the age of a translation for every non-null scoring.
Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(`Translation`.`created`) - `Translation`.`scoring_count`*60*30 DESC
So 30 minutes are added to the age of a translation for every non-null scoring.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation
On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com wrote:
Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(
Translation
.created
) -Translation
.scoring_count
_60_30 DESCSo 30 minutes are added to the age of a translation for every non-null scoring.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.
Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).
What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.
----- Original Message -----
From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation
On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com wrote:
Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(
Translation
.created
) -Translation
.scoring_count
_60_30 DESCSo 30 minutes are added to the age of a translation for every non-null scoring.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
We are saving hash based on IP and browser of user for each judging; it should be easy to filter that way; i do not think that we should ever allow to score one translation twice
On Thu, Sep 18, 2014 at 10:14 AM, Ondrej Bojar notifications@github.com wrote:
Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).
What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.
----- Original Message -----
From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation
On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com
wrote:
Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(
Translation
.created
) -Translation
.scoring_count
_60_30 DESCSo 30 minutes are added to the age of a translation for every non-null scoring.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56007896.
We should allow scoring of the same item by the same person more than once, but only occasionally, say at random with the chance of 10%. Having such data points in the dataset is useful for intra-annotator agreement checks.
----- Original Message -----
From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:41:14 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
We are saving hash based on IP and browser of user for each judging; it should be easy to filter that way; i do not think that we should ever allow to score one translation twice
On Thu, Sep 18, 2014 at 10:14 AM, Ondrej Bojar notifications@github.com wrote:
Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).
What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.
----- Original Message -----
From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation
On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com
wrote:
Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.
----- Original Message -----
From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)
I changed it so that we simply choose one translation using this clause:
ORDER BY TO_SECONDS(
Translation
.created
) -Translation
.scoring_count
_60_30 DESCSo 30 minutes are added to the age of a translation for every non-null scoring.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56007896.
Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56010702
Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo
We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).
What about using data storage in HTML5?
http://www.w3schools.com/html/html5_webstorage.asp
On Thu, Sep 18, 2014 at 10:53 AM, Ondřej Cífka notifications@github.com wrote:
We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56011871.
Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics -Charles University in Prague 16017 Prague 6 - Czech Republic Mob +420773515578
I do not think that we need to store any data at users side; we already have user identification, translation id and time of scoring in our database On Sep 18, 2014 11:11 AM, "Shadi Saleh" notifications@github.com wrote:
What about using data storage in HTML5?
http://www.w3schools.com/html/html5_webstorage.asp
On Thu, Sep 18, 2014 at 10:53 AM, Ondřej Cífka notifications@github.com wrote:
We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56011871.
Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics -Charles University in Prague 16017 Prague 6 - Czech Republic Mob +420773515578
— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56013718.
I just implemented the IP/browser hash approach. Now a user won't be asked to judge any translation more than once.
I just translated a few things and (say 20 minutes ago) and I wanted to score them and see them in the manual shutter. When do they appear in the scoring?
I believe we should really weight high the recency of the tweet and/or the translation.