Tatoeba / tatoeba2

Tatoeba is a platform whose purpose is to create a collaborative and open dataset of sentences and their translations.
https://tatoeba.org
GNU Affero General Public License v3.0
714 stars 132 forks source link

Allow retrieving users language levels using API #3076

Closed jiru closed 1 year ago

jiru commented 1 year ago

Story

A student in computational linguistics contacted us by email.

Currently, I am doing an NLP seminar project, where I need to analyze a paper's dataset, whose data is collected from Tatoeba. One focus of my project is to look at the language levels of the contributors. I checked Tatoeba API, and unfortunately, I didn't find a function that retrieves a user's language level. So, I wonder if there is any way to achieve this through script.

About my workflow: In the data corpus (Tapaco dataset) I use, they provide the sentence ID from Tatoeba for each data entry. I aimed to retrieve the metadata(self-proclaimed language level) about the user who posted this sentence with only the sentence ID.

I have tried to make a request GET call to URL "https://tatoeba.org/en/user/profile/{username}" and hoped maybe it would return the metadata about this user in JSON. But unfortunately, it didn't work.

My current workaround is: I retrieve the username from the corpus(OPUS corpus) where the Tapaco dataset originated and then get the self-proclaimed language skill level of this username from the user_languages file downloaded from "https://tatoeba.org/en/downloads".

Idea We add a new endpoint the API to retrieve information about a user, including language levels.

jiru commented 1 year ago

So you are doing it in two steps:

  1. Get username from sentence ID
  2. Get language level from username

It is now possible to accomplish step one using this URL: https://api.tatoeba.org/unstable/sentences/1234 The result is a json description of sentence 1234, that includes an owner key which value is the username.

I am working on step two.

jiru commented 1 year ago

@sun2i I started implementing step two, it is available on the dev server: https://api.dev.tatoeba.org/unstable/users/gillux

Note that dev.tatoeba.org only contains an outdated and partial copy of tatoeba.org, so you may only find some of the users.

jiru commented 1 year ago

@sun2i This is now available on https://api.tatoeba.org/.

So the API endpoints are: