LBeaudoux / tatoebatools

A library for fetching and reading Tatoeba's weekly exports
MIT License
20 stars 4 forks source link

No audio_id attribute for SentenceWithAudioObject #9

Closed jshamble closed 2 years ago

jshamble commented 2 years ago

Hi @LBeaudoux,

Taking a Look at https://tatoeba.org/en/downloads in "sentences_with_audio", there is an audio_id attribute, which is essential for downloading the audio files associated with each sentence.

Sentence id [tab] Audio id [tab] Username [tab] License [tab] Attribution URL

However, checking the SentenceWithAudio object in tateobatools https://github.com/LBeaudoux/tatoebatools/blob/master/tatoebatools/sentences_with_audio.py The attribute audio_id doesn't seem to be present.

Would you mind the audio_id attribute to the file above (add a param in the constructor, and add a function just like the other fields) making sure it is fetched correctly/testing to see if it gets the correct audio_id?

LBeaudoux commented 2 years ago

Hi @jshamble,

Thanks for pointing out this bug. The data file dedicated to audios has indeed been modified recently. I plan to make the necessary changes in tatoebatools as soon as possible.

LBeaudoux commented 2 years ago

Note that this issue is related to https://github.com/Tatoeba/tatoeba2/pull/2957

LBeaudoux commented 2 years ago

@jshamble The audio ID attribute should be available in the new v0.2.1 of tatoebatools. Do you confirm?

jshamble commented 2 years ago

@LBeaudoux Yes, I just tested it and it's working great. Many thanks for the quick work on this one!

jshamble commented 2 years ago

@LBeaudoux

It seems like trying to access the audio id using the sentence id is unusable (really slow speed, it's been a few hours since running the code below), this happens when iterating using the parallel corpus using the sentence_id as a key, I'm assuming it's due to it being O(n^2) due to a nested for loop. Here's some sample python code below using German, a language with a lot of audio samples (assuming tatoebatools is imported):


lang = "deu"
audioSentences = tatoeba.sentences_with_audio(lang)
for sentence, translation in ParallelCorpus("eng", lang):
     audioIdArray = [sentenceWithAudio.audio_id for sentenceWithAudio in audioSentences if sentenceWithAudio.sentence_id == translation.sentence_id]

I can think of two possible solutions, but would like to hear your ideas/possibly modify tatoebatools to support this operation more easily.

audioIdArray = [sentenceWithAudio.audio_id for sentenceWithAudio in audioSentences if sentenceWithAudio.sentence_id == translation.sentence_id]

LBeaudoux commented 2 years ago

is there any way to get O(1) access similar to a hashmap?

Maybe you could try:

from collections import defaultdict
from tatoebatools import tatoeba, ParallelCorpus

lang = "deu"

lang_audios = defaultdict(list)
for audio in tatoeba.sentences_with_audio(lang):
    lang_audios[audio.sentence_id].append(audio.audio_id)

for sentence, translation in ParallelCorpus("eng", lang):
    audioIdArray = lang_audios.get(translation.sentence_id, [])
jshamble commented 2 years ago

Yeah that works. This would probably be useful to add to the README (i.e. using Parallel Corpus with other tables efficiently)

LBeaudoux commented 2 years ago

I added an example with audios to the README.