Closed madewild closed 2 years ago
Try this:
cf. https://github.com/ec-doris/kohesio-search/issues/5 guesthouse is OOV in h2020 words similar to "health" do not include "medical" nor "medicine", see https://similarity.cnect.eu/
@faustusdotbe what is the status of this? still in line for next release (10 days left!) or to postpone to January?
Ah sorry -- we actually discussed this with Roberto last week. The model is trained and available at s3://doris-word2vec, namely wiki_300_5_word2vec_SP-kohesio.kv.model
and the vectors wiki_300_5_word2vec_SP-kohesio.kv.model.vectors.npy
.
Note that this is a KeyedVectors
model -- much faster than the Word2Vec
-- and using it will require light code editing (m = gensim.models.KeyedVectors.load(model_path)
, the querying should be the same.)
OK nice I think @D063520 and @AlyHdr are calling directly the https://similarity.cnect.eu/ service, can you confirm? @faustusdotbe then it would require modifying https://github.com/ec-doris/doris-similarity to allow querying this model (with a special parameter) while ensuring backward compatibility with older (non-KeyedVectors) models...
@AlyHdr @D063520
This is now fixed (https://github.com/ec-doris/doris-similarity/commit/f4bb2157d4782cf947b0162ea883ba457445bd2d). You can query the different models in eg Python like so, with $model_name
being either h2020
or kohesio
. If nothing is specified, it defaults to h2020
.
requests.post($API_URL, {
"text":"pizza",
"model": $model_name
}
)
Note that the kohesio
model is a very large model trained on EN wikipedia and finetuned on Kohesio, so despite being loaded in memory it is unfortunately a bit slower than the h2020 on this machine.
@faustusdotbe I searched "health" with the new model on https://similarity.cnect.eu/# and the results are not so great...
Looks like the Kohesio data introduced issues :
>>> w = gensim.models.Word2Vec.load("wiki_300_5_word2vec.model")
>>> k = gensim.models.Word2Vec.load("wiki_300_5_word2vec_SP-kohesio.model")
>>> w.wv.most_similar("health")
[('mental_health', 0.6961499452590942), ('healthcare', 0.687968373298645), ('wellbeing', 0.6430804133415222), ('nutrition', 0.6340615153312683), ('services_administration_hrsa', 0.6319378018379211), ('chronic_diseases', 0.6305569410324097), ('administration_osha', 0.6259971261024475), ('welfare', 0.6231546401977539), ('cbhep', 0.6225051283836365), ('communicable_diseases', 0.6121703386306763)]
>>> k.wv.most_similar("health")
[('ireps', 0.6490586996078491), ('guidelines_dictated_by_modern_trends', 0.6384396553039551), ('does_not_entail_medical', 0.6316791772842407), ('fight_against_covid_19', 0.6298130750656128), ('interest_stores', 0.6281540989875793), ('mazovia_”.', 0.6262164115905762), ('&_wellness', 0.6206086277961731), ('mobility_,...),_through', 0.619819164276123), ('factors_causing', 0.6184598207473755), ('equipment_catalogue_prepared', 0.6178537011146545)]
I'll retrain with fewer epochs on Kohesio data and see if it fixes it. Do we have a list of words that we need to be correct? For now I have those below, to which I add health
. It would be helpful in case several iterations are needed.
truth = {
"ai": ["artificial_intelligence"],
"5g": ["hspda", "umts"],
"4g": ["hspda", "umts"],
}
@roberto-musmeci could you provide a few good words to help testing the model?
It's quite tricky of course as, e.g., specialising on kohesio does help with "ai":
>>> w.wv.most_similar("ai")
[('yobundaze', 0.6389608383178711), ('somnium_files', 0.6207020878791809), ('torimodose', 0.6126745343208313), ('tegotae', 0.6041201949119568), ('nikushimi', 0.5994518995285034), ('shitteita', 0.5938058495521545), ('tsutsumarete', 0.5897074937820435), ('afures', 0.5876941084861755), ('rukeichi', 0.5853627324104309), ('tsukarete', 0.585219144821167)]
>>> k.wv.most_similar("ai")
[('artificial_intelligence', 0.6192606091499329), ('machine_learning', 0.6147264838218689), ('big_data_analytics', 0.6080207228660583), ('passing_relays', 0.5986993312835693), ('machine_learning_techniques', 0.5877189040184021), ('edge_computing', 0.5799916982650757), ('predictive_algorithms', 0.5776605606079102), ('based_on_artificial_intelligence', 0.5766475200653076), ('expanded_reality', 0.5765467286109924), ('artificial_intelligence_ai', 0.5751122832298279)]
(Results in vanilla wiki refer to Japanese rock band Sambomaster, with song Sekai wa sore Ai to Yobunda ze which apparently is the theme song of anime Naruto... )
Just to understand guys, would you like to have a list of words that works or that do not work in the current Kohesio model?
some that SHOULD work ;) just some interesting words that people are likely to use...
Ideally, a list of words that are very important to Kohesio as a whole and for which synonyms ought to be correct in vector space. I was thinking of your rehabilitation
example for example. Ideally, if you can provide some synonyms you expect for interesting words, that would be perfect:
eg:
ai
must/should be in the neighbourhood of artificial_intelligence
, machine_learning
, etc.
5g
must/should be in the neighbourhood of hspda
, umts
Ok, perfect. @faustusdotbe do you have already a list of the most recurring english words in the Kohesio corpus ? Otherwise I'll create it and start off from there
The counter is running, I'll update this comment with the file and ping you once it's done :-)
I write some suggestions here and keep expanding over time
Transportation: bridge, tram, tramway, railway, highway, road, micro_mobility, green_mobility, railway station, terminal, bus, bikes, bike, sharing, tunnel
Technology: ai, machine learning, data science, quantum computing, research and development, smart grid, district heating, electricity grid, broadband, wi-fi, ITC, energy efficiency, Digital animation, 3D animation, web design
Industrial sectors and products: retail, telecommunication, wood, water management, waste management, sewage, utilities, hydrogen, gas, gas plant, carbon plant, solar panels, wind turbines, hotellerie, restaurants, cafe, tourism attractions, hotels, wellness centre, mining, quarrying, transportation, communication, ITC, refrigeration,
Support to economic actors entrepreneurship, interntationalisation
Culture and art: museum, architecture, design, gallery, mosaics, frescos, restoration, castles, cultural heritages, old town, tourism animation
Labour markets (ESF mostly) Internships, traineeship, mobile workers, training, migrant workers, jobs, vacancies, workers, young worker, job seeker, NEET, unemployed, employees, under employed, salary, part-time, full-time, maternity leave, parental leave, sick leave, vocational training, vocational centre, professional course, people with disability, disabled people, job opportunity, entry-level, education opportunity, re-skilling, up-skilling, risk of unemployment, staff, long-term unemployed, inactive, Public Employment Centre, public benefits, unemployment benefit, professional activation
Social policies inclusion, over-indebtedeness, financial difficulties, fraudes, abuse, support network
@roberto-musmeci
kohesio_counts_min100.csv
Here's a tab-separated CSV, minimum frequency threshold is 100
It would be interesting to have a matrix count of words by country.
It seems that certain words are used more frequently in one country or programs and tend to have a more specific meaning in that context. The word animation
in french is more a synonim to launching something (e.g. an initiatve/project)
.
Similarly warming
is mostly used in Poland in the context of thermo modernisation
and energy efficiency
I can make this, but I wonder whether our assumptions about language will hold given everything is machine-translated.
Should be fixed now, please reopen if issues arise.
https://kohesio.eu/projects?keywords=youth seems strange... reopening just in case if there is a fix ;)
this seems related to the overwhelming number of small Italian projects... for instance https://kohesio.eu/projects?keywords=%22it%20lasts%206%20months%22 maybe when we have cleaned out Italy we should retrain the gensim model
The case of 'renewable' is also to be further investigated later on. In this case, it seems related a number of French projects. https://dev.kohesio.eu/projects?keywords=renewable
I was testing the semantic search with my kids and the model is not that bad :)
Remind me to bring them stickers for the next time we're at the office 🦕
But "ml" was working before and not anymore...
Perhaps we should play it safer and have a general-purpose model instead. It's hard to specialise without any test set to formally evaluate what we've done. In this case I don't really see why ml
is now similar to the Romanian commune of Ohaba for example. If we end up fixing this, we'll break something else.
@faustusdotbe for the 'Just transition' it was in the H2020 model. And for me this model was working quite well so far. So I don't know what is the state of play for that, we can discuss tomorrow morning and take the safiest approach.
@AThollard yes but H2020 is a model specialized on research and it does not cover all the domains of Kohesio. For instance "guesthouse" is inexistent in H2020, as well as concepts related to tourism, travel, etc. That's why we decided with @faustusdotbe and @roberto-musmeci to try to train a model more centered on Kohesio data, but since we don't have some much data (compared to H2020) the robustness is not perfect. It's a trade-off, we can indeed discuss tomorrow but I would not rather switch models again just before the launch...
@AThollard @roberto-musmeci
Have you encountered anything drastically wrong, misleading, or even offensive/insulting in the model? I am building a general-purpose tool to fix as many issues as possible using existing lists of offensive words and Roberto's list of "kohesio gold truth" hereabove, but any additional input would be super useful.
Hi @faustusdotbe nothing as bold as the 'woman' example last week However maybe we could fine tune the model for the minorities or religions that are quite sensitive eg: 'roma' 'jews' or 'lgbt' (for example why do we have 'extreme right' in the corpus for this last one'?
same if you search 'homosexuality' you end-up with 'salafi'
For 'sufism' I don't see the link with 'women writers'? We should definitly have something more neutral for all the religions
Thanks for these @AThollard !
I should have said that the website would revert to the h2020 model no matter what option was selected, sorry. The h2020 model is still the one used by Kohesio.
I've now pushed a new model (test cleaning
) on the machine and it is available through the web interface. The new model is based on the whole Wikipedia and fine-tuned on Kohesio, then cleaned (roughly 8.8k offensive or erroneous words/n-grams are removed).
Removing words is easier than modifying them, so for example when searching for woman
we do not have the prostitute
hit anymore, but we do still have disabled_person
-- which is an n-gram I can't remove.
eg: 'roma' 'jews' or 'lgbt' (for example why do we have 'extreme right' in the corpus for this last one'?
The algo is built on the notion of similarity, which can be paradigmatic or syntagmatic. I suspect that lgbt
and extreme_right
appear in the same contexts (probably more syntagmatic than paradimatic ones), and therefore are considered to be "similar". In the new model lgbt
this issue disappears
Moving forward I will be modifying the model so that it does not do the woman -- disabled
issue anymore, and remove religions, then report back
On https://dev.kohesio.eu/projects?keywords=ai we have a strange expanded keyword "( ai )-" Also "artificial intelligent" is strange Can we tweak the model to remove this noise?