cifkao / tct

Twitter Crowd Translation – infrastructure for human and machine tweet translation
GNU General Public License v2.0
1 stars 0 forks source link

Scoring order (was: when do the items for scoring get updated?) #21

Open obo opened 10 years ago

obo commented 10 years ago

I just translated a few things and (say 20 minutes ago) and I wanted to score them and see them in the manual shutter. When do they appear in the scoring?

I believe we should really weight high the recency of the tweet and/or the translation.

cifkao commented 10 years ago

We're checking for new e-mails every minute, so it should appear almost instantly (you can go to /admin/translations and sort by 'created' to check if the translation was added). However, we're picking the tweets for scoring completely at random, using this query (generated by CakePHP):

SELECT `Translation`.`post_id`, `Translation`.`lang_id`
  FROM `tct`.`translations` AS `Translation`
  WHERE 1 = 1
  GROUP BY `Translation`.`post_id`, `Translation`.`lang_id`
    HAVING COUNT(DISTINCT `Translation`.`text`)>=2
  ORDER BY rand() ASC
  LIMIT 1

How can we make it recency-weighted?

edasubert commented 10 years ago

a while ago these formulas were designed https://github.com/cifkao/tct/blob/master/papers/doc/ranking they are slightly recency-weghted as in the most recent (few minutes) gets boost which drops quickly

obo commented 10 years ago

This comment of mine did not make it from the e-mail to github:

A very simple (and not mathematically sound) idea:

ORDER BY
  rand()*1/(1/2*(age_of_translation + age_of_original))
  DESC

ie. a random number between 0 and a varying value for every translation

this varying number should somehow express the 'recency weight', the more recent,

the higher the weight

this makes the most recent item most likely as its rand range covers also higher

numbers that no other item can ever beat

My particular design of the recency weight is rather stupid: 1 over the average age of the translation itself and the original.

obo commented 10 years ago

I am marking this as a bug now since it prevents me from showing off how good we are in speedy translations. :-) If anyone can change the formula right now anything recency-based, it would be very useful.

cifkao commented 10 years ago

So I did this:

ORDER BY
  rand()*1/(
    TO_SECONDS(NOW())
    -0.5*(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created)))
  ) DESC

Didn't notice any change, but maybe that's because there are no really recent translations...

obo commented 10 years ago

I've submitted quite a few translations since then and I don't see them in the scoring yet. Obviously, we need to change the formula after we check how it actually behaves on real data.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Saturday, 6 September, 2014 11:32:47 AM Subject: Re: [tct] when do the items for scoring get updated? (#21)

So I did this:

ORDER BY
  rand()*1/(
    TO_SECONDS(NOW())
    -0.5*(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created)))
  ) DESC

Didn't notice any change, but maybe that's because there are no really recent translations...


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54707597

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

obo commented 10 years ago

I'll provide some.

We should also have a max-lag limit. We should not publish translations older than something, as the news can be almost misleading due to the delay. It's however not easy to define it, it depends on the nature of the news. Perhaps something like: never publish a tweet older than 48 hours, if its author has published more than 4 tweets in the meantime.

On September 6, 2014 11:32:47 AM CEST, "Ondřej Cífka" notifications@github.com wrote:

So I did this:

ORDER BY rand()_1/( TOSECONDS(NOW()) -0.5(TO_SECONDS(TranslationRequest.created)+MAX(TO_SECONDS(Translation.created))) ) DESC

Didn't notice any change, but maybe that's because there are no really recent translations...


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54707597

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

edasubert commented 10 years ago

Just to make sure we are on the same page, perfecting this formula was my plan for conference project And as described in the document I think it would help put put a bit of weight on tweet every time it is judged so we ensure equal distribution put weight means show less often On Sep 6, 2014 6:30 PM, "Ondrej Bojar" notifications@github.com wrote:

A very simple (and not mathematically sound) idea:

ORDER BY rand()1/(1/2(age_of_translation + age_of_original)) DESC

ie. a random number between 0 and a varying value for every translation

this varying number should somehow express the 'recency weight', the

more recent,

the higher the weight

this makes the most recent item most likely as its rand range covers

also higher

numbers that no other item can ever beat

My particular design of the recency weight is rather stupid: 1 over the average age of the translation itself and the original.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Friday, 5 September, 2014 5:13:24 PM Subject: Re: [tct] when do the items for scoring get updated? (#21)

We're checking for new e-mails every minute, so it should appear almost instantly (you can check admin/translations and sort by 'created' to see if the translation was added). However, we're picking the tweets for scoring completely at random, using this query (generated by CakePHP):

SELECT Translation.post_id, Translation.lang_id FROM tct.translations AS Translation WHERE 1 = 1 GROUP BY Translation.post_id, Translation.lang_id HAVING COUNT(DISTINCT Translation.text)>=2 ORDER BY rand() ASC LIMIT 1

How can we make it recency-weighted?


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54638006

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-54719506.

cifkao commented 10 years ago

I've submitted quite a few translations since then and I don't see them in the scoring yet.

Are they in the database?

obo commented 10 years ago

Yes, I saw them on my page as the translator.

On September 6, 2014 7:30:42 PM CEST, "Ondřej Cífka" notifications@github.com wrote:

I've submitted quite a few translations since then and I don't see them in the scoring yet.

Are they in the database?


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-54721637

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

obo commented 10 years ago

Could you please change the formula to something extremely simple so that most recent translations are scored first? Since we are scoring pairs, we should simply construct all (unordered) pairs and sort them by the average age for now. If made this simple, the single pair of two most recent translations would be growing scores and others would not get anything. So let's also add e.g. 6 hours to the average age for every scoring the pair already has. That will make sure that pairs with no scoring will get scored first (sorted by recency) and pairs with one scoring will get more scores only when there is no unscored pair younger than 6 hours. ...Well, as Eda says, perfecting this formula is the goal of the MTM project. :-)

cifkao commented 10 years ago

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(`Translation`.`created`) - `Translation`.`scoring_count`*60*30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.

obo commented 10 years ago

Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(`Translation`.`created`) -
`Translation`.`scoring_count`*60*30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

edasubert commented 10 years ago

Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation

On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com wrote:

Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(Translation.created) - Translation.scoring_count_60_30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.

obo commented 10 years ago

Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).

What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.

----- Original Message -----

From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation

On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com wrote:

Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(Translation.created) - Translation.scoring_count_60_30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

edasubert commented 10 years ago

We are saving hash based on IP and browser of user for each judging; it should be easy to filter that way; i do not think that we should ever allow to score one translation twice

On Thu, Sep 18, 2014 at 10:14 AM, Ondrej Bojar notifications@github.com wrote:

Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).

What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.

----- Original Message -----

From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation

On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com

wrote:

Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(Translation.created) - Translation.scoring_count_60_30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56007896.

obo commented 10 years ago

We should allow scoring of the same item by the same person more than once, but only occasionally, say at random with the chance of 10%. Having such data points in the dataset is useful for intra-annotator agreement checks.

----- Original Message -----

From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:41:14 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

We are saving hash based on IP and browser of user for each judging; it should be easy to filter that way; i do not think that we should ever allow to score one translation twice

On Thu, Sep 18, 2014 at 10:14 AM, Ondrej Bojar notifications@github.com wrote:

Yes, we need to somehow filter translations based on cookies (some very short term cookies should be ok).

What about storing scored translations in the browser, so whenever I score something, the ID will get saved into the list stored in my cookies. This list should expire in say 6 or 12 hours.

----- Original Message -----

From: "edasubert" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Thursday, 18 September, 2014 10:03:07 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

Scoring is now recent; however in about 10 translations i have just judged i got 3 times the same translation

On Wed, Sep 17, 2014 at 1:06 PM, Ondrej Bojar notifications@github.com

wrote:

Thanks! The scoring now seems much more up to date and interesting. Now we need good translations and the shutter.

----- Original Message -----

From: "Ondřej Cífka" notifications@github.com To: "cifkao/tct" tct@noreply.github.com Cc: "Ondrej Bojar" bojar@ufal.mff.cuni.cz Sent: Wednesday, 17 September, 2014 11:40:31 AM Subject: Re: [tct] Scoring order (was: when do the items for scoring get updated?) (#21)

I changed it so that we simply choose one translation using this clause:

ORDER BY TO_SECONDS(Translation.created) - Translation.scoring_count_60_30 DESC

So 30 minutes are added to the age of a translation for every non-null scoring.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-55870877

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-55878830.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56006825

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56007896.


Reply to this email directly or view it on GitHub: https://github.com/cifkao/tct/issues/21#issuecomment-56010702

Ondrej Bojar (mailto:obo@cuni.cz / bojar@ufal.mff.cuni.cz) http://www.cuni.cz/~obo

cifkao commented 10 years ago

We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).

salehshadi commented 10 years ago

What about using data storage in HTML5?

http://www.w3schools.com/html/html5_webstorage.asp

On Thu, Sep 18, 2014 at 10:53 AM, Ondřej Cífka notifications@github.com wrote:

We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56011871.

Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics -Charles University in Prague 16017 Prague 6 - Czech Republic Mob +420773515578

edasubert commented 10 years ago

I do not think that we need to store any data at users side; we already have user identification, translation id and time of scoring in our database On Sep 18, 2014 11:11 AM, "Shadi Saleh" notifications@github.com wrote:

What about using data storage in HTML5?

http://www.w3schools.com/html/html5_webstorage.asp

On Thu, Sep 18, 2014 at 10:53 AM, Ondřej Cífka notifications@github.com wrote:

We can either use session variables or cookies – I don't know which is better. We can also use the IP/browser hash, but that might be less accurate and we will have to start logging skipped scorings (now we're deleting them).

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56011871.

Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics -Charles University in Prague 16017 Prague 6 - Czech Republic Mob +420773515578

— Reply to this email directly or view it on GitHub https://github.com/cifkao/tct/issues/21#issuecomment-56013718.

cifkao commented 10 years ago

I just implemented the IP/browser hash approach. Now a user won't be asked to judge any translation more than once.