Different Json for different Language and type of user - Githubissues

OTTAA-Project / ottaa_project_flutter

Join us to create the first predictive augmentative communication platform for speech-impaired children!

https://ottaa-project.github.io/

GNU General Public License v3.0

10 stars 4 forks source link

Different Json for different Language and type of user #48

Closed hectoritr closed 2 years ago

hectoritr commented 2 years ago

Describe the solution you'd like When a new user is logging in, we should provide a trained JSON model for their predictions based on their preferred gender and age.

We are covering several types of genders:

Male
Female
Fluid
Binary
Other

And 3 age types:

Adult
Young
Child

So, we would need to create 15 different models according to the possible combinations. @gonojuarez will fetch the last data from the database.

@lopezjuanma96 will train and provide the different models. We can use Miguel's algorithm but also use the current metadata on the JSON file to improve the model, like the user's age and time of day of the sentence to TAG them.

@asimjawad will do the flutter implementation.

Comment below when your work is done.

Additional context Add any other context.

Action Plan Add an Action Plan with Checkboxes on key things you have to achieve to complete this task.

[x] Download data from OTTAA's database.
[x] Create models (15).
[x] Review models by field experts.
[x] Download correct model on Login.

lopezjuanma96 commented 2 years ago

[x] Load JSON with phrases, date and user
[x] Search user database for gender and age of each user and assign to phrase to generate dataset
[x] Scrub phrases to assign each pictogram a tag of age and gender, and count frequencies.
[x] Save phrases on different folders base on age/gender combination, to train Miguel's Algorithm. TECHNICALLY IS DONE, WE ARE ONLY PRINTING THE PHRASES AND THE FOLDER THEY WOULD GO TO, BUT WE'LL SAVE THEM PROPERLY WHEN WE GET ALL THE DATA.
[x] Scrub phrases to assign each pictogram a tag of Time of the Day, and count frequencies.
[x] Since some pictos already have tags, assign by the app or the users, we might generate two type of tags "calculated" and "assigned" and see if we can then compute something out of that comparison.

lopezjuanma96 commented 2 years ago

ISSUES REGARDING THE PHRASES JSON:

Look at this example of a phrase JSON:

{"frase":"Hola Buen día tengo castillo ","frecuencia":3,"complejidad":{"valor":0,"pictos componentes":[{"id":377,"esSugerencia":false},{"id":379,"horario":["MANANA"],"esSugerencia":false,"hora":["MANANA"]},{"id":49,"esSugerencia":false},{"id":945324633}]},"fecha":[1637327873525,1637327994767,1637329129227],"locale":"es","id":0}

some things we have to decide are:

[x] What do we do with an id like the last one? I believe that value is because it was a picto created or downloaded. Adding it to the database can be done but it will be complex since there's no simple relation between picto position and word position -in this case it's at the end of the phrase, but should it be in the middle, it's much harder and propense to error-.
[x] If we are going to use the tags set on the pictos aside from the ones "calculated" with the timestamps and user data, we have to solve why one picto has a key "horario" and the other has a key "hora". SOLUTION: @gonojuarez will put the keys and available tags in a comment on this issue, so that the scrubbing algorithm is compatible with current OTTAA state.
[x] Another important thing we already discussed, but I'll write it anyway so that we can keep it somewhere, is that the timestamps in each phrase are measured in MILISECONDS from epoch.

lopezjuanma96 commented 2 years ago

Span of keys and tags the app is using at the moment:

'key': {'TAGS'}

'hora': {'MANANA', 'MEDIODIA', 'TARDE', 'NOCHE'} 'edad': {'ADULTO', 'JOVEN', 'NINO'} 'sexo': {'FEMENINO', 'MASCULINO', 'BINARIO', 'FLUIDO', 'BINARIO'} 'ubicacion': {'ESTADIO', 'PARQUE',... //we are not using it for now

lopezjuanma96 commented 2 years ago

Possible Sources for Training Miguel's Algorithm:

EDIT: maybe it's best to work with medical questions datasets, which will actually include what our users would say in a medical/hospital context. Then the last dataset from the list before (https://github.com/curai/medical-question-pair-dataset) might be the best to try first, and we should also add others we can find, such as:

this came up after a really quick search in google, as examples, a better research migh give better results even.

lopezjuanma96 commented 2 years ago

Possible Sources for Training Miguel's Algorithm:

Law/Court: we might find court records, specially transcripts to be useful
- https://pacer.uscourts.gov/
- https://www.in.gov/courts/public-records/
- https://www.courts.ca.gov/42512.htm
- https://www.fedcourt.gov.au/services/access-to-files-and-transcripts/transcript
- https://www.rev.com/blog/how-to-get-court-transcripts

lopezjuanma96 commented 2 years ago

Whis might be useful: https://convokit.cornell.edu/documentation/datasets.html

lopezjuanma96 commented 2 years ago

Scientific Dataset: https://www.kaggle.com/datasets/Cornell-University/arxiv/code

The full dataset is REALLY large (1.1TB and growing), but we can download the metadata which have all titles, abstract, authors, categories, etc. With it we can select categories for each model and train with the abstracts or download some of the papers.

hectoritr commented 2 years ago

Download correct model on Login. When the user starts the app and select the Gender and DoB. We should store those value to be use by the prediction algorithm.

Screen Shot 2022-08-25 at 12 17 48

The options currently are

Gender

Male
Female
Other

Age (calculate the right TAG based on the DoB

Child 0-15 yr
Young 15-25 yr
Adult <25

These values should be stored as profile info of the user and used in the prediction.

Then based on the gender you have to download the right JSON dataset.

picto_es_male
picto_es_fem
picto_es_other

hectoritr commented 2 years ago

@asimjawad is this done? Download the right model accordint to the user?

asimjawad commented 2 years ago

@hectoritr we did not do this. add the required api here and I will be on it.

asimjawad commented 2 years ago

@lopezjuanma96 add them here.

hectoritr commented 2 years ago

This was resolved on #101

asimjawad commented 2 years ago

reopening this issue, because work on some parts was not done.

hectoritr commented 2 years ago

@asimjawad here is what I found so far

This was on the default database

Screen Shot 2022-10-05 at 10 40 32

This was on the testing database

Screen Shot 2022-10-05 at 10 41 29

asimjawad commented 2 years ago

@hectoritr can you explain this a further. I will see it in the morning.

hectoritr commented 2 years ago

I tried loggin as a female and changed the default database and not the testing one, I don't know which one are you using in dev. Just that. The main thing is that even though it asked me to choose the the gender it downloaded the male version.

asimjawad commented 2 years ago

@hectoritr we are using default one.

asimjawad commented 2 years ago

and as I told you that, the jsons are only loaded when a user will create a new account... and they will be choosing their gender. At that time we will upload and save the Json according to their gender.