CAMeL-Lab / camel_tools

A suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.
MIT License
413 stars 73 forks source link

[QUESTION] Why is output form dialect id system different from the ADIDA online interface? #141

Open fadhleryani opened 7 months ago

fadhleryani commented 7 months ago

camel_tools 1.5.2 on mac 14.1.1

Using the preloaded example sentences in the ADIDA interface, for instance: "بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم" I get a score of 95.9% for Beirut When I try to predict the same sentence using camel_tools, I get a different result. For example, using model26 which I assume is the same as in ADIDA

from camel_tools.dialectid import DIDModel26
did = DIDModel26.pretrained()
did.predict(['بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'])

I get the following scores: [DIDPred(top='ALE', scores={'ALE': 0.2744463749182225, 'ALG': 0.0019964477414507772, 'ALX': 0.0017124356871910278, 'AMM': 0.04793813798943018, ...

Similarly using model6, I also get different and lower scores than the online interface (but at least dialect is correct).

from camel_tools.dialectid import DIDModel6
did = DIDModel6.pretrained()
did.predict(['بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'])

I get the following scores: [DIDPred(top='BEI', scores={'BEI': 0.5475092868164938, 'CAI': 0.05423997031019218, 'DOH': 0.018378809169102468, 'MSA': 0.003793013408907513, 'RAB': 0.0018751946461352397, 'TUN': 0.37420372564916876})]