Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
696 stars 132 forks source link

Indicate the contributors of audio #547

Closed tommy-3 closed 7 years ago

tommy-3 commented 9 years ago

Currently, there's no way to know who recorded which sentence. By indicating the contributors of audio, we'll be able to help learners interested in a specific accent and, above all, encourage more people to contribute audio.

This involves basically three things.

  1. Sentence -> username On sentence pages, indicate the contributors of audio. The easiest way would be to change the alt text of the speaker icon to e.g. "Play audio recorded by TRANG".
  2. Username -> sentences Make a page that lists the sentences recorded by a certain member and add a link to it on the profile of him/her.
  3. Download Change the structure of the export file sentences_with_audio. It will look like "12345[tab]TRANG".

We could consider solving at the same time another issue #183 "allow multiple audio files per sentence". In that case:

  1. Sentence -> username 1.1. For a sentence with audio, show a number of recordings on the right of the speaker icon. 1.2. When you click the speaker icon, one of the recordings (chosen randomly) are played. 1.3. The alt text of the icon will be e.g. "Play audio #2 recorded by TRANG". 1.4. When you click the number next to the speaker icon, a list of recordings appear below the sentence: **#1 2011-01-01 01:01:01 recorded by sysko [play icon]

    2 2022-02-02 02:02:02 recorded by TRANG [play icon]**

  2. Username -> sentences 2.1. Make a page e.g. "Audio recorded by TRANG" and add a link to it on the profile of the member. 2.2. The page will basically look like 1.4 above, but only the audio recorded by the user (TRANG in this case) are displayed: **C'était seulement il y a un an. [flag] [speaker icon] 2

    2 2022-02-02 02:02:02 recorded by TRANG [play icon]**

    2.3. When you click the number next to the speaker icon, the other recordings will be displayed under those by TRANG.

  3. Download The export file sentences_with_audio will look like "12345[tab]2[tab]TRANG", which means the audio #2 of the sentence #12345 was recorded by TRANG.
  4. Filenames The filename of the audio #1 of the sentence #12345 will remain 12345.mp3. The filename of the audio #2 will be 12345_2.mp3.

I tried suggesting a concrete plan, but this is of course only one of many possibilities. I hope this will be the basis for further discussions.

alanfgh commented 9 years ago

I tried suggesting a concrete plan, but this is of course only one of many possibilities.

You proposed a way to reflect the feature in the GUI (and the export file structure). This is very important, but actually formulating a plan would include figuring out such aspects as:

Naturally, this is true for any task we do, but I think it's worth mentioning the need for project management here because this is not a small project.

allan-simon commented 9 years ago

@alanfgh, reading the metadata etc. was already problem solved by shtooka project, I don't know if they can easily be contacted now, but I guess the best way would be to "outsource" that to them, otherwise we will just reinvent the wheel.

certainly host audio on their websites, associate a meta 'tatoeba_id' and get their information through a simple api call

shtooka.test/tatoeba/sentence/{id}/audios

which will return the URL to the audio(s) and the meta for them

of course it works only if we can contact them and propose them some tight collaboration.

tommy-3 commented 9 years ago

@alanfgh Maybe I should have written this on the Wall. I have no idea about how it should work internally. May I assign you (or someone else) to this issue?

jiru commented 9 years ago

@tommy-3 I think Github is the right place for technical discussions. We don’t seriously use the Github task assignation thing so don’t bother.

@alanfgh I try to answer each of your points.

how we're going to read the tags from the MP3 files (a nontrivial task due to the number of files and the possibility that some of them were recorded with different versions of tags)

We will need the metadata to be stored in the database if we want to be able to do something with it at a reasonable speed (including reading it). Which means any tag would just duplicate information, which complicates the problem. Finally, reading metadata from the audio files is, as you said, quite a tedious task. Because of all this, I am strongly against using tags in the first place. Only storing metadata in the database is way simpler and enough to solve this ticket.

where we're going to store that information

How about the database?

how much effort is involved in changing the database and code to enable this functionality

I don’t think such a thing involves many efforts. I’d go with removing the has_audio field on the sentences table and adding an audio table with foreign keys to users and sentences.

who will do the work

I may help.

how it will affect other work being done

To my knowledge, the only other ongoing work at the moment is #77 (editable transcripts) and I don’t think it would interfere.

jiru commented 9 years ago

@allan-simon Can you elaborate on that 'outsourcing' plan? Apart from saving disk space and bandwidth, what are the advantages of hosting the audio files at Shtooka? How are we supposed to make use of the metadata on the Tatoeba website (for instance displaying the records belonging to a given user) if it’s stored on Shtooka?

allan-simon commented 9 years ago

you can't base your metada on the assumption it match one to one to a Tatoeba's user.

the idea behind was that right now we're reinveting shtooka online project, i.e storing sounds and metadata about audio recording, so my idea was that as there's already a project that does that, it will sounds duplicate to redoit. and it will permit our sound data to have a further reach (as for example shtooka already create package of them that can after be use by their swac tools)

if we do that, we can after ask them to permit their shtooka recorder to upload directly on their servers (which would make a lot of sense for them). and as a side effect it will make the step of uploading ourselves audio/updating db manually easier

alanfgh commented 9 years ago

I want to make sure that we clarify the scope of what we're taking on and which particular "pain points" we're addressing. I also want to make sure that we keep in mind throughout this discussion that we need to deal with existing audio, not just audio to be added in the future.

Pain points: (1) There is no obvious way for users to determine the contributor of an audio file, from which they could deduce the speaker's accent (from the speaker's profile, for instance). (Maybe there's an unobvious way, such as reading the tags from the downloaded audio files using a separate utility?)

(2) There is no obvious way for users to determine the speaker's accent from other explicit metadata associated with the audio file.

(3) Developers cannot currently distribute our audio under a CC-BY license because when we obtained it from them, we didn't ask them for permission to redistribute it.

(4) If we want to address (3), we need to: (a) identify the contributors of existing audio (b) contact them for permission to redistribute their audio (c) obtain permission from contributors of new audio to have it redistributed

(5) If we want to address (4a), we need to be able to read contributors' information from existing MP3 files. As far as I know from several hours' worth of research and experimentation, this is a nontrivial task because: (a) the format of MP3 files has changed over time (though I don't know much about this, or whether this is a factor with our files) (b) there doesn't seem to be an off-the-shelf tool that does this; the command-line tool that I experimented with, id3v2, produces output that needs to be parsed, maybe not the biggest task, but it requires some amount of experimentation to see what variation there is in the output

(6) Audio is nontrivial for individuals to contribute because: (a) there's nothing in the interface that lets them do it directly via Tatoeba (b) they need a decent setup (microphone, etc.) (c) they need to produce audio of decent quality, which requires judgment (d) under our current workflow, they need to transmit it to CK

(7) Audio is nontrivial for CK to add because he needs to: (a) listen to it and evaluate it (b) process it (though I don't know all the details of what he does) (c) zip it up (d) upload it to an intermediate directory on the server but automating these steps could be dangerous unless substantial programming work went into making it foolproof.

(8) Audio is nontrivial for me to add because I need to: (a) ssh over to the server (b) move and unzip the files (c) formulate a script command based on the language and name of the directory to which the files have been unzipped (d) execute the command but automating these steps could be dangerous unless substantial programming work went into making it foolproof, which might have to go beyond the errors for which my script already checks: missing files, missing directories, missing database entries, incorrect file permissions.

(9) If a sentence contains audio, any change to its text (such as a correction, if the original text is not good) or deletion of the sentence will cause a mismatch between audio files and the database. I've written a script to find and repair these mismatches, but it involves some manual work. Trang has suggested making sentences with audio non-editable and non-deletable except by admins (#524), but it would also be possible to manage audio more actively (that is, files as well as database entries) via the interface.

(10) The fact that we currently allow at most one audio file per sentence means that: (a) sentences whose written form corresponds to multiple possible spoken forms (common in Hebrew, for instance) can only have one of those forms represented by audio (b) users have no ability to compare and contrast speakers with different genders, accents, etc.

A proper analysis would involve consideration of the importance of these various items. I'm not going to go into that in this comment, but I do want to say that we should seriously consider which of these issues we want to address, and which ones to address together vs. separately, and in which order, and how they might impact other ongoing work in terms of our resources (even if, as @jiru has said, there isn't much impact in terms of touching the same area of code that other tasks might). In terms of who will do the work: I myself am willing to take on smaller well-defined tasks, but not the bulk of them, and not overall responsibility. @jiru has said that he may help. Since @trang has said that audio is not high priority for her at this point, we might infer that she wouldn't be able to invest much time in this. Therefore, it doesn't seem like we have the resources to make it happen at this time, though that doesn't rule out discussion, planning, and infrastructure work that could make it happen in the future. Please keep in mind that simply keeping up with the discussion takes a non-trivial amount of time, and we need to be sure that key people such as @trang (and CK) are not getting left behind. It's a holiday season (at least in the West), which might mean that they are not able to keep up with this at the moment.

I do want to say that if the goal is to increase the number of people contributing audio, then we need to be sure that we're not simply increasing the pressure at the bottlenecks. This also holds true for changes in file format and where the files are stored, as well as changes to the code and database.

jiru commented 9 years ago

@alanfgh Thank you for that thoughtful comment.

About (5), some records just don’t have any tags, see for example #333173. From some random records I checked, some are tagged but nothing is consistent. It’s just a courtesy of contributors and/or admins. While tags may help tracing back original contributors, we can’t base our research on that only. I’m familiar with ID3 tags so I may help extracting relevant information from existing files. @alanfgh and @trang, do you think it’s feasible to trace back the contributors of existing audio by finding the original mail they sent to team@tatoeba.org or something like this? If we consider the problem by the number of record batches we got (instead of the number of records), does it boil down to a reasonable number of requests to investigate into?

About the order of these task, I think we can derive it from their dependency, as follows (the numbers don’t indicate strict order, it’s just a simple way to refer to each point): ① identify existing audio contributors ② indicate the contributor of existing and future audio contributors (requires ①) ③ indicate the speaker’s accent of records (requires ②) ④ contact existing audio contributors to ask permission to redistribute under CC-BY (requires ①) ⑤ be able to redistribute audio under CC-BY (requires ④) ⑥ contributed audio is non-trivial to install for administrators and only CK and Alan may do it (requires changes in terms of database, file etc. by ② to be completed) ⑦ audio is non-trivial to contribute for users (solving ⑥ should help a bit) ⑧ avoid mismatches between audio and text ⑨ allow multiple audio per sentence (requires changes in terms of database, file etc. by ② to be completed)

In my opinion, while solving ② may increase the number of audio contribution a little bit, only solving ⑦ will involve a substantial increase. So I don’t expect an increase of pressure on admins in charge of installing audio records.

So I think we should start with ① and ②.

allan-simon commented 9 years ago

about 3)

all audio acquired through the mean of shtooka recorder explicitly ask you for a license when you start recording which includes at least

I think it's the same for spanish but less sure.

alanfgh commented 9 years ago

That's worth keeping in mind. To elaborate, there is a field "License" on the GUI that gives the a choice between Creative Commons BY and Creative Commons BY-CA. (Note that the second is not mentioned on the Creative Commons website. Perhaps Creative Commons BY-SA is meant?) It would be nice if the GUI linked to the Creative Commons website. I don't remember how the GUI looked when I freshly installed it, but I believe there's a default value (rather than "Choose one"), so I believe the user is not forced to choose one of the options and could remain unaware of its existence. That doesn't bother me, but it might bother some people.

Do we know of any people who contributed audio that was not recorded via Shtooka?

On Fri, Dec 26, 2014 at 8:46 AM, Allan Simon notifications@github.com wrote:

about 3)

all audio acquired through the mean of shtooka recorder explicitly ask you for a license when you start recording which includes at least

  • my audios (most of french ones)
  • Chinese and SHanghainese (Fu Congcong)
  • English (CK and our other friend of which I have the name on the tongue) , CK has already put the license somewhere

I think it's the same for spanish but less sure.

— Reply to this email directly or view it on GitHub https://github.com/Tatoeba/tatoeba2/issues/547#issuecomment-68141659.

trang commented 9 years ago

@trang, do you think it’s feasible to trace back the contributors of existing audio by finding the original mail they sent to team@tatoeba.org or something like this?

Probably time consuming but possible. I can export all the emails that I've got that were sent to team@tatoeba.org and contain "audio" in the subject. There seem to be about ~100 emails. Then someone can go through these emails and try to contact people who likely have contributed audio.

tommy-3 commented 9 years ago

Then someone can go through these emails and try to contact people who likely have contributed audio.

I can do this if I may.

trang commented 9 years ago

@tommy-3, I've just sent you the emails.

jiru commented 9 years ago

@trang can you make the mails available to us too?

@tommy-3 can you use like a shared google document or similar so that we may help out contacting people and gathering information?

trang commented 9 years ago

@jiru, done.

tommy-3 commented 9 years ago

I idenfied most of the voices, but there are still some things I'm not sure about. @allan-simon, could you reply to my email I sent to team@tatoeba.org?

jiru commented 9 years ago

Any news regarding this?

tommy-3 commented 9 years ago

I uploaded what I've found out by now. http://hi1811.agilityhoster.com/audio1.csv (sentences recorded by CK) http://hi1811.agilityhoster.com/audio2.csv (contributors of other audio)

Some voices were identified based on the ID3 tags or the text files that CK makes before uploading files. I'm almost 100% sure about these. Some others were identified based on wall posts and information provided by Trang and sysko. Sometimes I needed to make a guess a bit. See https://docs.google.com/document/d/1lTBIpguhhGxgz6926pqpwZiIgQeCmlVaCiAXDlMpAyw/edit?usp=sharing for details.

There still remain four contributors who haven't been identified. I'm waiting for @allan-simon to reply. I'll copy the four questions here:

(fra) (1) 199 French recordings were added on 2012-01-14. Most of them are sentences owned by sacredceltic. Who recorded them? Trang says it might be Jean-Rémy Duboc. http://tatoeba.org/user/profile/jrduboc Sample: http://audio.tatoeba.org/sentences/fra/1337232.mp3

(nld) (2) 1026 Dutch recordings were added on 2010-12-05. They are all sentences owned by Dorenda. Who recorded them? Sample: http://audio.tatoeba.org/sentences/nld/377450.mp3

(por) (3) 94 Portuguese recordings were added on 2011-01-19. They are all sentences owned by brauliobezerra. Were they recorded by brauliobezerra? Sample: http://audio.tatoeba.org/sentences/por/579858.mp3

(These are the oldest Portuguese audio files. I know from sysko's Wall message #7801 on 2011-09-14 that he's from Brazil. I believe all the other Portuguese recordings are by alexmarcelo, added between 2011-09-21 and 2011-10-23.)

(spa) (4) 103 Spanish recordings were added on 2012-01-16. They are all sentences owned by tatoerique. Were they recorded by tatoerique? Sample: http://audio.tatoeba.org/sentences/spa/11030.mp3

There are three more small issues.

(5) Pongprapunt, the contributor of Thai audio, doesn't seem to have a Tatoeba account. I think I'll ask him to make one.

(6) I think I'll ask Inego's wife to make a Tatoeba account.

(7) How should we deal with Barack Obama's voice? http://audio.tatoeba.org/sentences/eng/896662.mp3 http://audio.tatoeba.org/sentences/eng/897923.mp3 http://audio.tatoeba.org/sentences/eng/897924.mp3 http://audio.tatoeba.org/sentences/eng/897925.mp3 http://audio.tatoeba.org/sentences/eng/897909.mp3

Trang said (in an email on January 23) that we'll let each user choose a license. I think we can provide two options we recommend: Creative Commons Attribution (CC BY) 4.0 International License and Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 International License. (The Creative Commons recommends that we use a version 4.0 international license. https://wiki.creativecommons.org/Frequently_Asked_Questions#Why_should_I_use_the_latest_version_of_the_Creative_Commons_licenses.3F https://wiki.creativecommons.org/Frequently_Asked_Questions#Should_I_choose_an_international_license_or_a_ported_license.3F) Users can also choose any other license of any version if they want. They can also choose not to license the files at all, which means only the visitors of Tatoeba.org can listen to them.

And then the license of each file should be shown everywhere near the speaker icon, and we'll also need a list of audio contributors and the licenses on the download page.

I think we should make the new terms of use before contacting the members. This doesn't seem to belong to the current issue, so I wrote about it here. https://groups.google.com/forum/#!topic/tatoebaproject/CF1YcspnWes

ckjpn commented 9 years ago

This is listed as "effort:high." However, if you could figure out a way to display my private "audio" lists on the pages the same way you have collaborative lists showing, then members could easily see this information. This might not be the best way to do it, but it might be a fast temporary fix.

What I mean is all the lists of mine that begin with "audio - ".

The search doesn't narrow it down, since it's not case-sensitive, and ignores the hyphen. This link jumps you to the 2nd page, so you'll see which lists I'm talking about. https://tatoeba.org/eng/sentences_lists/of_user/CK/audio%20-/page:2

RyckRichards commented 9 years ago

Requested by CK on Tatoeba Day 1.

ckjpn commented 8 years ago

As mentioned on Jul 25, 2015, this would be a relatively easy thing to do, if you could somehow display my lists of audio contributors. Currently, all private lists are hidden to everyone except the owner of the lists. Since I'm the owner of these lists, I can easily see who contributed each audio file. If you could include these list names on each page so they are viewable by everyone, then this issue could be closed.

trang commented 8 years ago

this would be a relatively easy thing to do, if you could somehow display my lists of audio contributors.

When #677 will be released on the prod, you will be able to set your audio lists as publicly viewable.

trang commented 8 years ago

When #677 will be released on the prod, you will be able to set your audio lists as publicly viewable.

I'm rectifying this statement. Setting your lists to "public" will not be a solution. While implementing the private lists feature, I decided to remove the display of lists on the sentences page except for lists that belong to the user.

As a temporary solution for users to see who contributed each audio file, you could consider adding tags ("audio by {user}").

ckjpn commented 8 years ago

As a temporary solution for users to see who contributed each audio file, you could consider adding tags ("audio by {user}").

Is this something you want me to do?

I can write a script that will do that tagging via the web interface.

Perhaps you could take my audio lists and do that tagging faster in the database. I could then, just add tags using the same script I use to update the lists every time new audio files are added.

Or, we could just wait until you come up with another solution for letting people know who recorded the audio on a sentence's page. That's fine with me, too. At this point, people like me who may need the information for other projects can easily get the sentence numbers in the exported data from my lists. The list titles clearly give the licensing information and attribution URL for the ones that are available for use off tatoeba.org. (Perhaps, if you add tags, the tag names should include this, too. However, that might make the tag names too long.)

You can use my lists in the process of developing a new system when you have time to develop one.

trang commented 8 years ago

Is this something you want me to do?

For me it doesn't change anything. I think the lists are enough.

I can't speak for other users but so far nobody requested this.

ckjpn commented 8 years ago

I'll plan to leave things as they are. I think for now lists are enough. It's enough for me anyway.

Tommy requested to know who recorded sentences. I assume he means that when he's on a sentence's page, he wants to know who recorded it.

tommy-3 commented 8 years ago

I think it should be easier for users to know who contributed the audio in order to motivate people to add more audio.

Contributors of sentences are motivated by the reactions we get from other users. We get questions about our sentences, we learn about our mistakes and we get our sentences translated. Users might visit our personal websites or even send us fan letters. That's all because other users can easily know who wrote each sentence.

Contributors of audio would rarely get any of these because there's little chance that users find out who recorded each sentence. Perhaps a lot of users assume that it's the voice of the owner of the sentence. Even if they know that's not the case, when they have something to say about audio, it's most likely that they write comments on sentence pages. In most cases, the owners respond to them and the contributors of audio don't even know their audio is being discussed. They're virtually placed outside the community of Tatoeba.

When I opened this issue, I was trying to find someone who would record our Japanese sentences. I had to tell them that the website was not at all nice to contributors of audio, that we didn't credit them on sentence pages. Naturally, most of them didn't want to take part in such a project.

ckjpn commented 8 years ago

I agree with a lot of what Tommy says here.

It would be nice to have some way to do the following.

  1. Show who recorded the audio on each sentence's page. Minimally, this could just be a mouse-over like the tags are being done now, but maybe you'd want to make it more obvious.
  2. Show the number of audio contributed by each member in this profile the same way the number of sentences is listed.
  3. Be able to list all sentences with audio by a member from the profile in the same way we can list all sentences by a member.
  4. Have the name of the contributor of each audio file included in the exported data. it could be the second "field" of sentences_with_audio.tar.

I think all of these ideas fit in with the overall design of the Tatoeba Project.

jiru commented 7 years ago

I’m working on this.

ckjpn commented 7 years ago

I wonder if it would be possible to also create something like the following to go along with this.

http://aitech.ac.jp/~iteslj/a4esl/temporary/tatoeba/lists/audio.html

screen shot 2016-12-15 at 00 11 39

jiru commented 7 years ago

I made some good progress on this.

@ckjpn I will be using your audio lists to fill the database with audio authors. I copied the data (list ids, authors and licence) to a script that will be executed when this feature will be installed on the production server. So, please keep me updated with any modification you make to your lists, i.e. any new list, and licence or author update.

@tommy-3’s CSV files disappeared, but I wonder how much of your findings are already integrated into CK’s lists?

@ckjpn I performed a few sanity check on your lists.

But these inconsistencies do not prevent this feature from being implemented. Recordings for which we don’t know the author will simply display like now. We can add the actual authors later.

Note that, as a first step, I will stick on displaying the audio contributors only (on the tooltip of the play button and profile pages), while laying the foundations for other fancy stuff in the future like multiple audios, speaker’s origin, the page ckjpn just suggested etc.

@ckjpn I have a few other questions.

ckjpn commented 7 years ago

Perhaps eventually, something like this could be do to identify the person who recorded the audio.. http://www.manythings.org/audiosentences/rus/171.html

ckjpn commented 7 years ago

@tommy-3’s CSV files disappeared, but I wonder how much of your findings are already integrated into CK’s lists?

I used his data to help me do my lists, so I don't think anything got lost.

ckjpn commented 7 years ago

. I copied the data (list ids, authors and licence)

Please include the country/dialect data, too. For some languages, it may not matter so much, but for others it does.

trang commented 7 years ago

Many of your list titles mention no license for offsite use. I’m looking for an short identifier for this, just like we have CC BY 4.0 for Creative Commons Attribution 4.0 International. Any idea?

I would just set the license to null in this case. null would mean:

  1. The contributor of the audio has not given any information about who else can reuse the audio, in which case we consider that the audio is only for Tatoeba.
  2. The contributor has specifically said they want the audio to be used only on Tatoeba. Which is the same result as above.

If we really need to make a distinction between case 1 and case 2, then:

  1. no info on license = null
  2. audio only for tatoeba = tatoeba

But I can't think of a situation where we would need to make this distinction.