Marc-Bogonovich / Openwords

This is our first Openwords repository on GitHub
0 stars 0 forks source link

Wiktionary mining refinement #22

Open Marc-Bogonovich opened 10 years ago

Marc-Bogonovich commented 10 years ago

Archie Zhao has already successfully mined the Wiktionary. I believe you are already following up on the below goals. Please communicate with me on progress through this issue.

1) Refine the mining solution to a) gather as much additional information essential for 1.0 LMs

hanaldo commented 10 years ago

Seems there is a way of getting our first piece of Word Audio cake.

Firstly, the Audios on Wiktionary are really bad. So just discard it.

Secondly, the most reasonable way to get word audios is still through TTS, the only additional thing we need to do is to just save those TTS streams as files to our server for later cache. And I compared a lot of TTS engines, it seems Google still has the best one (I mean it sounds good).

So, anything you typed in on https://translate.google.com if it can be pronounced then we can download it as audio files. Of course Google cannot cover all the languages with TTS, also there is some quality defects in some languages such as Italian (according to my own judgement).

Here are some languages that I tested on the Google Translate for TTS, but it should support more, I just don't have time to test those yet: English Chinese Simple French Italian Spanish Japanese Portuguese Korean

Normally, downloading 1000 word audios and save them to our server would cost 2 minutes. I think it is a quite acceptable solution for now, let's talk more about this on next meeting.

Marc-Bogonovich commented 10 years ago

Shenshen, After discussing this issue in person. This appears to be a very good way to proceed. Thank you for developing this strategy. We had originally abandoned this path but after the conversation with you, Guan, and Archie, I think we can proceed with this.

This TTS -> audio file strategy will cover major languages while we work on the workflow process for creating Audio files ourselves.

Here is an overview of the decisions we made yesterday (Saturday 2014-09-06).

  1. TTS -> Audio file should be created for langauges covered by a good TTS engine such as that available at Google translate's public site.
  2. Regarding LMs. When a TTS derived audio is available for a word that should be dled. When a human voice or a TTS derived audio is available, the human voice should be dled. When neither are available, of course no audio is available.
  3. The front end behaviors must adapt to the new situation. These include the Hearing module and any module that has an audio icon. Here are the behaviors: If there is neither a human audio or a TTS derived audio for any words, then one should not enter the Hearing LM. Additionally, if audio files are lacking, for words within other LMs, those words can be added to plates, but their audio icon should be the lighter gray variety.

This covers everything as far as I can tell.

One last note: I'm not sure I understood what you mean by "Word Audio Cake"

hanaldo commented 10 years ago

Hi Marc, the "Cake" just means some free data that everyone wants to have a piece of, I think it's just a jargon among cyber-hackers^_^

By the way, I don't remember we have abandoned this path previously, I thought we abandoned the Android TTS, not the Google TTS. Anyway, an update about the mining progress: I can't request the Google TTS service too frequently otherwise they are blocking my IP, so I need to slow down my mining program, and currently downloading 18000 word audios would need about 25 hours.

Marc-Bogonovich commented 10 years ago

Hanaldo, Yes, you are correct. This Summer we did abandon the Android TTS and not the Google TTS. Though, earlier (Summer 2013) we had looked at Google as a source of audio but decided against it at the time. The reason we decided against it at the time was related to both ownership questions, and we were thinking about just calling on the website directly, rather than downloading ahead of time as you have suggested (& that would have lead to UX problems). However, your plan to download ahead of time would circumvent the UX problems.

Nonetheless, the fact that Google blocks your IP, may be an indication of their legal opinion/position as to the publicness of their TTS results on their public site. That is concerning, but I'm overall encouraged by your strategy. Who knows, whatever Google's position on their data is, Google may not be correct - if this situation was analogous to a book, it would be public information.

The rate you are downloading at is a good rate.

hanaldo commented 10 years ago

One suggestion about when inputting data into our running/serving database, the data records in "languages" and "words" tables should not be updated or deleted once they are in there, as any modification may lead to a large chained modification.

hanaldo commented 10 years ago

Completed Audios:

Chinese

Spanish

Japanese

German

Portuguese

Italian

Korean

Google TTS Not Supported:

Farsi

Marc-Bogonovich commented 10 years ago

Awesome. I will provide the language priorities. The first priority are languages for which we have groups that are "potential customers", the second priority are groups that are "potential alpha testers"

Potential customers: We have connections with the Chinese Flagship program and Swahili Flagship program. When we get the Swahili words in the db, we can add that.

Potential alpha testers (not including those languages you've mined above): Farsi Ukrainian Portuguese Italian Korean Deutsch (German) French