EEXCESS / c4

C4 - Cultural and sCientific Content in Context - is the EEXCESS context detection framework written in JavaScript. It provides supporting functionionality to enable easy user mining and querying for all EEXCESS clients. It supports for example Named Entity Recognition using the DOSeR Service, paragraph detection, Citation Buidling etc.
http://eexcess.eu/
1 stars 1 forks source link

detect language before NER #11

Open schloett opened 8 years ago

schloett commented 8 years ago

The NER-service supports NER for English and German (if no language definition is there, it assumes English). Detect the language client-side and inform the NER-service about it. To request NER for a specific language, the format looks like:

{
    "paragraphs" : [<the paragraphs>],
    "language" : "de"
}

However, the quality for German NER seems to be far below the English one.

schloett commented 8 years ago

language is detected in v4.3.0

chseifert commented 8 years ago

What languages are detected? Users observed to receive Dutch results (Rijksmuseum), while on a German page. Is there only "de" "en" and "others"? Re-opened, but might just have been for discussion.

schloett commented 8 years ago

Unfortunately, the piece of information relevant to this issue is missing: What is the language of the extracted keywords? (i.e. the query). The language detection in C4 is used to select the strategy for keyword extraction (atm calling the NER-service for English and applying some heuristic for all other languages). Therefore, the extracted keywords must be in English (in case the language has been detected as English, then the keywords returned from the NER-service are English) or in the language of the page. Otherwise we would have invented a magic heuristic, which translates German to Dutch. Imho, the returned Dutch results are beyond the scope of C4 and in the responsibility of the federated recommender. For the sake of completeness, the list of possibly detectable languages: "ab": "Abkhazian", "af": "Afrikaans", "ar": "Arabic", "az": "Azeri", "be": "Belarusian", "bg": "Bulgarian", "bn": "Bengali", "bo": "Tibetan", "br": "Breton", "ca": "Catalan", "ceb": "Cebuano", "cs": "Czech", "cy": "Welsh", "da": "Danish", "de": "German", "el": "Greek", "en": "English", "eo": "Esperanto", "es": "Spanish", "et": "Estonian", "eu": "Basque", "fa": "Farsi", "fi": "Finnish", "fo": "Faroese", "fr": "French", "fy": "Frisian", "gd": "Scots Gaelic", "gl": "Galician", "gu": "Gujarati", "ha": "Hausa", "haw": "Hawaiian", "he": "Hebrew", "hi": "Hindi", "hmn": "Pahawh Hmong", "hr": "Croatian", "hu": "Hungarian", "hy": "Armenian", "id": "Indonesian", "is": "Icelandic", "it": "Italian", "ja": "Japanese", "ka": "Georgian", "kk": "Kazakh", "km": "Cambodian", "ko": "Korean", "ku": "Kurdish", "ky": "Kyrgyz", "la": "Latin", "lt": "Lithuanian", "lv": "Latvian", "mg": "Malagasy", "mk": "Macedonian", "ml": "Malayalam", "mn": "Mongolian", "mr": "Marathi", "ms": "Malay", "nd": "Ndebele", "ne": "Nepali", "nl": "Dutch", "nn": "Nynorsk", "no": "Norwegian", "nso": "Sepedi", "pa": "Punjabi", "pl": "Polish", "ps": "Pashto", "pt": "Portuguese", "pt-PT": "Portuguese (Portugal)", "pt-BR": "Portuguese (Brazil)", "ro": "Romanian", "ru": "Russian", "sa": "Sanskrit", "bs": "Serbo-Croatian", "sk": "Slovak", "sl": "Slovene", "so": "Somali", "sq": "Albanian", "sr": "Serbian", "sv": "Swedish", "sw": "Swahili", "ta": "Tamil", "te": "Telugu", "th": "Thai", "tl": "Tagalog", "tlh": "Klingon", "tn": "Setswana", "tr": "Turkish", "ts": "Tsonga", "tw": "Twi", "uk": "Ukrainian", "ur": "Urdu", "uz": "Uzbek", "ve": "Venda", "vi": "Vietnamese", "xh": "Xhosa", "zh": "Chinese", "zh-TW": "Traditional Chinese (Taiwan)", "zu": "Zulu"

chseifert commented 8 years ago

Just to be completely sure: The language is also sent in the user profile (or just used to parametrize the keyword selection)?

schloett commented 8 years ago

The language is only used to parametrize the keyword selection. Languages in the user profile are only sent, if the user specifies and discloses languages in the user profile settings.

chseifert commented 8 years ago

Hmm, this seems to be annoying for users - not to have the content mapped automatically to the web page. I add Thomas to discuss the privacy issues for sending the detected language. Programmatically this should not be a big issue as I understand.

ThomasCerq commented 8 years ago

I'm not sure I understand the issue...

As far as I understand:

I guess we could try to map both information: if the web page is in english, then the recommendations should be in english too. But it would also make sense to me if recommendations in spanish were sent (if the user explicitly said she understands spanish).

Does it make sense?

ThomasCerq commented 8 years ago

Just FYI, in the user profile UI, only european languages are included. But happy to change it if needed.

chseifert commented 8 years ago

Thanks for you answer.

Yes, it makes sense. Language in User Profile means "the languages I speak and understand" - The language we detect is the language of the page (also presumably "a language I understand"). Both can be used for source selection. My only concern is that

As for the process of "keywords in a specific language" to "results in the specific language" without an explicit language flag sent to the recommender - I don't think this works. What they have implemented (according to deliverable D3.3.) is the source selection based on language when they receive an explicit flag.

I am not sure how to solve this.

schloett commented 8 years ago

From my point of view, it is an issue of the recommender to provide the "correct" results. On clientside, we could only ease the burden of the recommender. "Correct" results would be either in one of the languages disclosed in the user profile (if any) or the language of the page. If the user discloses the languages of the profile, we send them anyway. I agree with you, that it is unlikely, that a user provides this information, hence what remains is the language of the page. This information is already encoded in the query and the recommender might detect the language of the query and filter / rank the results accordingly. When such a mechanism is in place at the recommender, I am perfectly fine with providing language information about the current page, in order to overcome the "detect language of the query" step. However, as long as there is no such mechanism is in place, I would not change the profile again to send information which is never used. Furthermore, to provide correct results, imho there are other modifications necessary between federated recommender and partner recommenders: lang In this example, the language is unknown for all items from DDB and Mendeley, while the language of all items from ZBW is German, which is obviously not correct for the first result.

chseifert commented 8 years ago

Jörg, you are right, with a) we can only ease the burden of the recommender and b) should not employ API changes unless we are sure the information will be used. I am just taking up user comments and trying to find out i) whether and ii) how we could tackle them.

Following the discussion I come to the conclusion that

Potential (simple) improvements would be

ThomasCerq commented 8 years ago

Following the discussion I come to the conclusion that

Potential (simple) improvements would be

schloett commented 8 years ago

"Could we have a kinda of screen tip when there are several languages in the results? It would say something like "Results are in english, german and spanish. If you don't understand some of them, you can set the languages you understand in the user profile UI". (Should be displayed only the first time or from time to time.)" In principle, this should be possible, but I am not sure about the benefits. In particular, I don't know, how strict the source selection based on languages in the user profile is. In addition, as far as I know, many items in Europeana are "multi-lingual" and hence, even if the source selection is strict, strange results (from the user's POV) might be returned. For example, English is set in user profile, multi-lingual result will be returned, but have a title in Portuguese. @hziak could you shed some light on us?

ThomasCerq commented 8 years ago

If we can't link users' languages, then may be there's no point at all to include this feature in the user profile UI...

schloett commented 8 years ago

Sry, I'm not saying, that we cannot link them at all, but I assume that it won't work in each and every case (depending on the partner). Maybe we should just relax "Results are in english, german and spanish. If you don't understand some of them, you can set the languages you understand in the user profile UI" a bit more clearly and make it transparent to the user, that we will do our best, based on the information provided, but cannot give guarantees. (When reading the aforementioned notification, as a user I would expect to get only results in the languages I've specified).

ThomasCerq commented 8 years ago

I wasn't saying that either ;-) But it would be very confusing for a user who properly set the languages to get results in a language she doesn't understand. But you're right, the solution would be to say "give us your preferences and we'll do our best"

schloett commented 8 years ago

I've included a hint on the post-install help page ("On the profile settings page, you can provide information about you, in order to tailor the retrieval process of EEXCESS towards your needs."). I would postpone the screen tip when there are results in several languages to a later point in time, because currently, providing a language in the profile seems to produce results opposite to the intended behavior (see https://github.com/EEXCESS/recommender/issues/29).