cisnlp / GlotLID

GlotLID: Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
https://arxiv.org/abs/2310.16248
Apache License 2.0
84 stars 7 forks source link

Model Output to a 2 character output #1

Closed MokshitSurana closed 4 months ago

MokshitSurana commented 10 months ago

For language translation, most of the models need a 2-character source input language (en, hi) etc. Is there a way to get that kind of output from the model?

kargaranamir commented 10 months ago

Hi @MokshitSurana,

To make sure I understand your question, could you provide me with an example pair?

kargaranamir commented 10 months ago

My understanding is that you have this as one sentence:

sent = Hi this weather is good. हाय ये मौसम अच्छा है.

and you want to get the two languages available in here.

If you are sure that two languages are available here, then you can use

model.predict(sent, k=2)

and then retrieve the first two labels. We can also find out which n-grams contributed to each label, because model is linear. For example, the first set of n-grams contributed to eng_Latn, and the second set of n-grams contributed to hin_Deva, something like this:

eng-hin-sample