Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
679 stars 131 forks source link

The exported CSV file including the list sentences with audio and their date isn't updated #3056

Closed Guybrush88 closed 1 year ago

Guybrush88 commented 1 year ago

To Reproduce I was browsing this page to see the exported files' list: https://downloads.tatoeba.org/exports/ and I noticed that the ones containing the list of sentences with audio were last updated on 10-Oct-2019

Expected behavior The files containing the list of sentences with audio and their date should be updated weekly among with the other exported files.

Immagine 2023-05-06 094249

ckjpn commented 1 year ago

This isn't a bug. That file is not part of the weekly export.

I asked TRANG for this information and she generated the file for me and put it here.

That said, I would love to see this updated from time to time, even if not every week.

jiru commented 1 year ago

As CK said that file isn’t part of the weekly export.

@Guybrush88 Do you want that file to be included in the weekly export?

Guybrush88 commented 1 year ago

@Guybrush88 Do you want that file to be included in the weekly export?

In my opinion, that could be a relevant info for people using Tatoeba's data for external websites, so I would personally use also such files with regular updates, if I'd like to properly reuse audio, but I guess other opinions are welcome to find a proper and more effective solution.

ckjpn commented 1 year ago

If including this info is for people using Tatoeba's data for external websites, then include the licensing for each file might be a good idea. That info is already in the sentences_with_audio.tar.bz2 file.

Maybe just adding the dates into that file would accomplish what Guybrush88 desires.

That would be useful information for me, too.

jiru commented 1 year ago

As CK mentioned, sentences_with_audio.csv already has enough data to properly give attribution. The only information in sentences_with_audio_and_date.csv that is not already in sentences_with_audio.csv is the creation time and last modification time. However this data is not very accurate. All audio that was uploaded before #1378 got merged have a date set to zero, which accounts for about 30% of all the files we have now. Between #1378 and #2880, disabling an audio used to reset the date, and I know a number of disabling/enabling happened in order to temporarily allow editing sentences. I think the mp3 file last modification date could be a much better indicator, but we do not export it at the moment.

Because of this, I suggest I just remove that file and close this ticket.

ckjpn commented 1 year ago

What I had actually asked TRANG for at the time she created this file was directory listings with the dates on the files. Assuming, files haven't had their dates changed when moving them around, then those dates are maybe more likely to be more accurate for what I wanted.

I wanted to know this, so I could more quickly see which of my own audio files I might want to listen to and consider re-recording.

I would love to get such directory listings now if that's something you could do for me. I'm primarily interested in only the English audio files, but I could likely make use of a complete listing of all files.

I think closing this ticket would be OK. I think TRANG just put the file I requested in that directory for me with the intention of not leaving it there.

jiru commented 1 year ago

I removed these two files sentences_with_audio_and_date.tar.bz2 and sentences_with_audio_and_date.csv.