Indexing non-English parliament corpora raises error

lukavdplas commented 4 months ago

What went wrong?

Not sure if I did something wrong here. I tried indexing parliament-sweden-old locally and got this error:

elasticsearch.BadRequestError: BadRequestError(400, 'mapper_parsing_exception', 'Failed to parse mapping: analyzer [stemmed_en] has not been configured in mappings')

I fixed it by changing the definition of the speech field:

https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/d040118c4cdc477044011bf649e326a83c017315/backend/corpora/parliament/sweden-old.py#L88-L89

To:

    speech = field_defaults.speech()
    speech.extractor = CSV(field='text')
    speech.language = 'sv'
    speech.es_mapping = main_content_mapping(token_counts=True, stopword_analysis=True, stemming_analysis=True, language=speech.language)

So I think this corpus is just missing its own definition for the mapping (and language) of the speech field? This seems to be true for other parliament corpora too.

What did you expect to happen?

The index operation should run without exceptions.

Screenshot

No response

Where did you find the bug?

a local server

Version

develop (~5.4.0)

Steps to reproduce

Configure the backend settings to include the parliament-sweden-old corpus. Add the corpus definition to CORPORA and add any string value for PP_SWEDEN_OLD_DATA.
Run yarn django index parliament-sweden-old

BeritJanssen commented 4 months ago

Yes, this is indeed still a to do on which I got stuck: I have a branch somewhere that applies the new mapping style (with language suffix) for all corpora, but realized that we can't deploy this unless we reindex all corpora first. I did not know the best solution for this at the time, and then forgot to flag this problem.

What we could do:

apply new mapping style to & reindex all non-English corpora
overhaul mapping style such that only corpora with multiple values in the languages array will get the new mapping style

The second option will be harder to understand for outside developers, I think, but so will be the language suffix for (the majority of) corpora which aren't multilingual.

lukavdplas commented 4 months ago

Ah, I see. I don't think it's high-priority right now, but perhaps we can add a comment in the corpus definitions?

Do you think that choice would have an effect on #992 ?

BeritJanssen commented 4 months ago

No, I don't think so, as the analyzers are defined per corpus. The different language analyzers won't affect the query syntax, as far as I can foresee. Visualizations, however, may be affected by this. Will have to look at this again and will comment on the issue if I spot some problems.

lukavdplas commented 3 months ago

Hm, actually, I would prefer it if this were fixed sooner rather than later. I actually do index them quite regularly on my local machine for testing. They're now in a weird state where the code does not work but is still supposed to be maintained.

UUDigitalHumanitieslab / I-analyzer