ec-doris / kohesio-backend

APIs serving Kohesio's frontend
https://kohesio.ec.europa.eu
6 stars 2 forks source link

Improve gensim model #74

Closed madewild closed 2 years ago

madewild commented 2 years ago

On https://dev.kohesio.eu/projects?keywords=ai we have a strange expanded keyword "( ai )-" Also "artificial intelligent" is strange Can we tweak the model to remove this noise?

drvenabili commented 2 years ago

Try this:

madewild commented 2 years ago

cf. https://github.com/ec-doris/kohesio-search/issues/5 guesthouse is OOV in h2020 words similar to "health" do not include "medical" nor "medicine", see https://similarity.cnect.eu/

madewild commented 2 years ago

@faustusdotbe what is the status of this? still in line for next release (10 days left!) or to postpone to January?

drvenabili commented 2 years ago

Ah sorry -- we actually discussed this with Roberto last week. The model is trained and available at s3://doris-word2vec, namely wiki_300_5_word2vec_SP-kohesio.kv.model and the vectors wiki_300_5_word2vec_SP-kohesio.kv.model.vectors.npy.

Note that this is a KeyedVectors model -- much faster than the Word2Vec -- and using it will require light code editing (m = gensim.models.KeyedVectors.load(model_path), the querying should be the same.)

madewild commented 2 years ago

OK nice I think @D063520 and @AlyHdr are calling directly the https://similarity.cnect.eu/ service, can you confirm? @faustusdotbe then it would require modifying https://github.com/ec-doris/doris-similarity to allow querying this model (with a special parameter) while ensuring backward compatibility with older (non-KeyedVectors) models...

drvenabili commented 2 years ago

@AlyHdr @D063520

This is now fixed (https://github.com/ec-doris/doris-similarity/commit/f4bb2157d4782cf947b0162ea883ba457445bd2d). You can query the different models in eg Python like so, with $model_name being either h2020 or kohesio. If nothing is specified, it defaults to h2020.

requests.post($API_URL, {
        "text":"pizza", 
        "model": $model_name
            }
)

Note that the kohesio model is a very large model trained on EN wikipedia and finetuned on Kohesio, so despite being loaded in memory it is unfortunately a bit slower than the h2020 on this machine.

madewild commented 2 years ago

@faustusdotbe I searched "health" with the new model on https://similarity.cnect.eu/# and the results are not so great...

drvenabili commented 2 years ago

Looks like the Kohesio data introduced issues :

>>> w = gensim.models.Word2Vec.load("wiki_300_5_word2vec.model")
>>> k = gensim.models.Word2Vec.load("wiki_300_5_word2vec_SP-kohesio.model")
>>> w.wv.most_similar("health")
[('mental_health', 0.6961499452590942), ('healthcare', 0.687968373298645), ('wellbeing', 0.6430804133415222), ('nutrition', 0.6340615153312683), ('services_administration_hrsa', 0.6319378018379211), ('chronic_diseases', 0.6305569410324097), ('administration_osha', 0.6259971261024475), ('welfare', 0.6231546401977539), ('cbhep', 0.6225051283836365), ('communicable_diseases', 0.6121703386306763)]
>>> k.wv.most_similar("health")
[('ireps', 0.6490586996078491), ('guidelines_dictated_by_modern_trends', 0.6384396553039551), ('does_not_entail_medical', 0.6316791772842407), ('fight_against_covid_19', 0.6298130750656128), ('interest_stores', 0.6281540989875793), ('mazovia_”.', 0.6262164115905762), ('&_wellness', 0.6206086277961731), ('mobility_,...),_through', 0.619819164276123), ('factors_causing', 0.6184598207473755), ('equipment_catalogue_prepared', 0.6178537011146545)]

I'll retrain with fewer epochs on Kohesio data and see if it fixes it. Do we have a list of words that we need to be correct? For now I have those below, to which I add health. It would be helpful in case several iterations are needed.

truth = {
    "ai": ["artificial_intelligence"],
    "5g": ["hspda", "umts"],
    "4g": ["hspda", "umts"],
}
madewild commented 2 years ago

@roberto-musmeci could you provide a few good words to help testing the model?

drvenabili commented 2 years ago

It's quite tricky of course as, e.g., specialising on kohesio does help with "ai":

>>> w.wv.most_similar("ai")
[('yobundaze', 0.6389608383178711), ('somnium_files', 0.6207020878791809), ('torimodose', 0.6126745343208313), ('tegotae', 0.6041201949119568), ('nikushimi', 0.5994518995285034), ('shitteita', 0.5938058495521545), ('tsutsumarete', 0.5897074937820435), ('afures', 0.5876941084861755), ('rukeichi', 0.5853627324104309), ('tsukarete', 0.585219144821167)]
>>> k.wv.most_similar("ai")
[('artificial_intelligence', 0.6192606091499329), ('machine_learning', 0.6147264838218689), ('big_data_analytics', 0.6080207228660583), ('passing_relays', 0.5986993312835693), ('machine_learning_techniques', 0.5877189040184021), ('edge_computing', 0.5799916982650757), ('predictive_algorithms', 0.5776605606079102), ('based_on_artificial_intelligence', 0.5766475200653076), ('expanded_reality', 0.5765467286109924), ('artificial_intelligence_ai', 0.5751122832298279)]

(Results in vanilla wiki refer to Japanese rock band Sambomaster, with song Sekai wa sore Ai to Yobunda ze which apparently is the theme song of anime Naruto... )

roberto-musmeci commented 2 years ago

Just to understand guys, would you like to have a list of words that works or that do not work in the current Kohesio model?

madewild commented 2 years ago

some that SHOULD work ;) just some interesting words that people are likely to use...

drvenabili commented 2 years ago

Ideally, a list of words that are very important to Kohesio as a whole and for which synonyms ought to be correct in vector space. I was thinking of your rehabilitation example for example. Ideally, if you can provide some synonyms you expect for interesting words, that would be perfect:

eg: ai must/should be in the neighbourhood of artificial_intelligence, machine_learning, etc. 5g must/should be in the neighbourhood of hspda, umts

roberto-musmeci commented 2 years ago

Ok, perfect. @faustusdotbe do you have already a list of the most recurring english words in the Kohesio corpus ? Otherwise I'll create it and start off from there

drvenabili commented 2 years ago

The counter is running, I'll update this comment with the file and ping you once it's done :-)

roberto-musmeci commented 2 years ago

I write some suggestions here and keep expanding over time

Transportation: bridge, tram, tramway, railway, highway, road, micro_mobility, green_mobility, railway station, terminal, bus, bikes, bike, sharing, tunnel

Technology: ai, machine learning, data science, quantum computing, research and development, smart grid, district heating, electricity grid, broadband, wi-fi, ITC, energy efficiency, Digital animation, 3D animation, web design

Industrial sectors and products: retail, telecommunication, wood, water management, waste management, sewage, utilities, hydrogen, gas, gas plant, carbon plant, solar panels, wind turbines, hotellerie, restaurants, cafe, tourism attractions, hotels, wellness centre, mining, quarrying, transportation, communication, ITC, refrigeration,

Support to economic actors entrepreneurship, interntationalisation

Culture and art: museum, architecture, design, gallery, mosaics, frescos, restoration, castles, cultural heritages, old town, tourism animation

Labour markets (ESF mostly) Internships, traineeship, mobile workers, training, migrant workers, jobs, vacancies, workers, young worker, job seeker, NEET, unemployed, employees, under employed, salary, part-time, full-time, maternity leave, parental leave, sick leave, vocational training, vocational centre, professional course, people with disability, disabled people, job opportunity, entry-level, education opportunity, re-skilling, up-skilling, risk of unemployment, staff, long-term unemployed, inactive, Public Employment Centre, public benefits, unemployment benefit, professional activation

Social policies inclusion, over-indebtedeness, financial difficulties, fraudes, abuse, support network

drvenabili commented 2 years ago

@roberto-musmeci
kohesio_counts_min100.csv

Here's a tab-separated CSV, minimum frequency threshold is 100

roberto-musmeci commented 2 years ago

It would be interesting to have a matrix count of words by country. It seems that certain words are used more frequently in one country or programs and tend to have a more specific meaning in that context. The word animation in french is more a synonim to launching something (e.g. an initiatve/project).

Similarly warming is mostly used in Poland in the context of thermo modernisation and energy efficiency

drvenabili commented 2 years ago

I can make this, but I wonder whether our assumptions about language will hold given everything is machine-translated.

drvenabili commented 2 years ago

Should be fixed now, please reopen if issues arise.

madewild commented 2 years ago

https://kohesio.eu/projects?keywords=youth seems strange... reopening just in case if there is a fix ;)

madewild commented 2 years ago

this seems related to the overwhelming number of small Italian projects... for instance https://kohesio.eu/projects?keywords=%22it%20lasts%206%20months%22 maybe when we have cleaned out Italy we should retrain the gensim model

roberto-musmeci commented 2 years ago

The case of 'renewable' is also to be further investigated later on. In this case, it seems related a number of French projects. https://dev.kohesio.eu/projects?keywords=renewable

image

madewild commented 2 years ago

I was testing the semantic search with my kids and the model is not that bad :) gensim

drvenabili commented 2 years ago

Remind me to bring them stickers for the next time we're at the office 🦕

madewild commented 2 years ago

But "ml" was working before and not anymore... ml

drvenabili commented 2 years ago

Perhaps we should play it safer and have a general-purpose model instead. It's hard to specialise without any test set to formally evaluate what we've done. In this case I don't really see why ml is now similar to the Romanian commune of Ohaba for example. If we end up fixing this, we'll break something else.

AThollard commented 2 years ago

@faustusdotbe for the 'Just transition' it was in the H2020 model. And for me this model was working quite well so far. So I don't know what is the state of play for that, we can discuss tomorrow morning and take the safiest approach.

madewild commented 2 years ago

@AThollard yes but H2020 is a model specialized on research and it does not cover all the domains of Kohesio. For instance "guesthouse" is inexistent in H2020, as well as concepts related to tourism, travel, etc. That's why we decided with @faustusdotbe and @roberto-musmeci to try to train a model more centered on Kohesio data, but since we don't have some much data (compared to H2020) the robustness is not perfect. It's a trade-off, we can indeed discuss tomorrow but I would not rather switch models again just before the launch...

drvenabili commented 2 years ago

@AThollard @roberto-musmeci

Have you encountered anything drastically wrong, misleading, or even offensive/insulting in the model? I am building a general-purpose tool to fix as many issues as possible using existing lists of offensive words and Roberto's list of "kohesio gold truth" hereabove, but any additional input would be super useful.

AThollard commented 2 years ago

Hi @faustusdotbe nothing as bold as the 'woman' example last week However maybe we could fine tune the model for the minorities or religions that are quite sensitive eg: 'roma' 'jews' or 'lgbt' (for example why do we have 'extreme right' in the corpus for this last one'?

AThollard commented 2 years ago

same if you search 'homosexuality' you end-up with 'salafi'

AThollard commented 2 years ago

For 'sufism' I don't see the link with 'women writers'? We should definitly have something more neutral for all the religions

drvenabili commented 2 years ago

Thanks for these @AThollard !

I should have said that the website would revert to the h2020 model no matter what option was selected, sorry. The h2020 model is still the one used by Kohesio.

I've now pushed a new model (test cleaning) on the machine and it is available through the web interface. The new model is based on the whole Wikipedia and fine-tuned on Kohesio, then cleaned (roughly 8.8k offensive or erroneous words/n-grams are removed). Removing words is easier than modifying them, so for example when searching for woman we do not have the prostitute hit anymore, but we do still have disabled_person -- which is an n-gram I can't remove.

eg: 'roma' 'jews' or 'lgbt' (for example why do we have 'extreme right' in the corpus for this last one'?

The algo is built on the notion of similarity, which can be paradigmatic or syntagmatic. I suspect that lgbt and extreme_right appear in the same contexts (probably more syntagmatic than paradimatic ones), and therefore are considered to be "similar". In the new model lgbt this issue disappears

Moving forward I will be modifying the model so that it does not do the woman -- disabled issue anymore, and remove religions, then report back