Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
704 stars 132 forks source link

Possibility to filter sentences according to their licence #1788

Open RyckRichards opened 5 years ago

RyckRichards commented 5 years ago

Now that we handle sentences under CC 2.0 and CC0, it'd be good for developers/course creators to filter sentences according their license. As mentioned by CK in https://tatoeba.org/eng/wall/show_message/31351#message_31351, we'd have to "climb a mountain" do have such thing (at least for me that's not experienced in programming):

  1. Download all the sentences. http://downloads.tatoeba.org/ex...tences.tar.bz2
  2. Download the sentence numbers with CC0 license. https://downloads.tatoeba.org/e...es_CC0.tar.bz2
  3. Grab all the sentences with these numbers from the sentences.csv file.

As we have filters for sentences written by native speakers, audio attached.

ckjpn commented 5 years ago

I'd say this is a low priority request, since at this time there aren't enough CC0 licensed sentences. Anyone using the data would pretty much have to grab sentences from the whole corpus to make it worthwhile.

trang commented 5 years ago

As replied on the Wall:

The sentences_CC0.csv file has the text of the sentences already. There is no need to do further processing, you can just download the file.

@RyckRichards Do you need anything more than what's in the file?

RyckRichards commented 5 years ago

I'm not that good on working with .csv files. It'd be good (if it doesn't take too much effort, of course) if there is a way to select which sentences we want to download.

trang commented 5 years ago

What criteria exactly would you use to select the sentences?

I suppose you would want all CC0 sentences in a specific language (not in all language). Is there anything else?

RyckRichards commented 5 years ago

What criteria exactly would you use to select the sentences?

Most translated, audio attached, new sentences, old sentences

At this time there are so few sentences in the file that you can easily open the file in a word processor, or in Excel or Google Sheets. You can use OpenOffice if you don't own Excel.

Agreed but I believe it will change soon.

trang commented 5 years ago

@RyckRichards When you say "possibility to filter", what do you expect actually? Do you need a file containing all CC0 sentences with specified criteria, or do you need it as search results so you can pick sentences to make lists?

RyckRichards commented 5 years ago

Yes, that's right.

--

Ricardo Vernaut Junior

trang commented 5 years ago

That's not answering my question... Does that mean you need both?

RyckRichards commented 5 years ago

Oh sorry. Yes, both of them.

--

Ricardo Vernaut Junior