Closed jshamble closed 2 years ago
Hi @jshamble,
Thanks for pointing out this bug. The data file dedicated to audios has indeed been modified recently. I plan to make the necessary changes in tatoebatools as soon as possible.
Note that this issue is related to https://github.com/Tatoeba/tatoeba2/pull/2957
@jshamble The audio ID attribute should be available in the new v0.2.1 of tatoebatools
. Do you confirm?
@LBeaudoux Yes, I just tested it and it's working great. Many thanks for the quick work on this one!
@LBeaudoux
It seems like trying to access the audio id using the sentence id is unusable (really slow speed, it's been a few hours since running the code below), this happens when iterating using the parallel corpus using the sentence_id as a key, I'm assuming it's due to it being O(n^2) due to a nested for loop. Here's some sample python code below using German, a language with a lot of audio samples (assuming tatoebatools is imported):
lang = "deu"
audioSentences = tatoeba.sentences_with_audio(lang)
for sentence, translation in ParallelCorpus("eng", lang):
audioIdArray = [sentenceWithAudio.audio_id for sentenceWithAudio in audioSentences if sentenceWithAudio.sentence_id == translation.sentence_id]
I can think of two possible solutions, but would like to hear your ideas/possibly modify tatoebatools to support this operation more easily.
audioIdArray = [sentenceWithAudio.audio_id for sentenceWithAudio in audioSentences if sentenceWithAudio.sentence_id == translation.sentence_id]
is there any way to get O(1) access similar to a hashmap?
Maybe you could try:
from collections import defaultdict
from tatoebatools import tatoeba, ParallelCorpus
lang = "deu"
lang_audios = defaultdict(list)
for audio in tatoeba.sentences_with_audio(lang):
lang_audios[audio.sentence_id].append(audio.audio_id)
for sentence, translation in ParallelCorpus("eng", lang):
audioIdArray = lang_audios.get(translation.sentence_id, [])
Yeah that works. This would probably be useful to add to the README (i.e. using Parallel Corpus with other tables efficiently)
Hi @LBeaudoux,
Taking a Look at https://tatoeba.org/en/downloads in "sentences_with_audio", there is an audio_id attribute, which is essential for downloading the audio files associated with each sentence.
Sentence id [tab] Audio id [tab] Username [tab] License [tab] Attribution URL
However, checking the SentenceWithAudio object in tateobatools https://github.com/LBeaudoux/tatoebatools/blob/master/tatoebatools/sentences_with_audio.py The attribute audio_id doesn't seem to be present.
Would you mind the audio_id attribute to the file above (add a param in the constructor, and add a function just like the other fields) making sure it is fetched correctly/testing to see if it gets the correct audio_id?