cfedermann / Appraise

Appraise evaluation system for manual evaluation of machine translation output
http://www.appraise.cf/
BSD 3-Clause "New" or "Revised" License
73 stars 37 forks source link

Compare unique items not system outputs #45

Open mjpost opened 9 years ago

mjpost commented 9 years ago

Many times, the systems outputs for a sentence are identical. Rather than constructing each task from a random subset of systems, each task should be constructed from the set of distinct outputs for that sentence. The pairwise rankings could then be re-associated with the systems to generate a larger set of pairwise rankings.

This would be a bit more respectful of people's times (it's annoying to see identical outputs), and would also let us potentially gather data more quickly. On the WMT14 data, for example, there are identical system outputs on over half the sentences.

CC: @cfedermann

cfedermann commented 9 years ago

This will be fixed for WMT15

mjpost commented 9 years ago

I found a workaround for this, that creates entries where the system name is a comma-delimited list of systems. You then just have to split those out and compile out the (often much larger) set of rankings. If you want what I've done, let me know. That might be a better way than trying to do it internally.

cfedermann commented 9 years ago

@mjpost elegant solution; I didn't plan to do this deduping internally. Data will be rendered as-is.

Can you point me to your code for this?

cfedermann commented 9 years ago

Commit 00131810b5 addresses this during batch generation...

cfedermann commented 9 years ago

@mjpost I'm preparing sample files next; it would be nice if you could have a quick look when you get a chance...

mjpost commented 9 years ago

Yes, please send them. I likely won't have time till later in the day but will prioritize it.

cfedermann commented 9 years ago

Aloha @mjpost, mini batches are inside the repo (new wmt15data folder). They look good to me but I'm a little worn out by now ;)

Any feedback you might have is very welcome.

Cheers and best, Christian

-----Original Message----- From: "Matt Post" notifications@github.com Sent: ‎5/‎7/‎2015 5:15 AM To: "cfedermann/Appraise" Appraise@noreply.github.com Cc: "Christian Federmann" cfedermann@gmail.com Subject: Re: [Appraise] Compare unique items not system outputs (#45)

Yes, please send them. I likely won't have time till later in the day but will prioritize it. — Reply to this email directly or view it on GitHub.

cfedermann commented 9 years ago

@mjpost Batch 1 files have been added (wmt15data/full-batches folder).

I have checked that exporting data for "multi systems" generates the right CSV format, possibly spanning more than a single line. I add one or more PLACEHOLDER systems to make sure we end up with five systems per row. The corresponding rank is -1, so this does not have an effect on scoring. Will verify that soon...

Let me know if you spot any issues with the data.

mjpost commented 9 years ago

Thanks. Kann ich einen invite token?