CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
320 stars 71 forks source link

Dialect specifier breakage #511

Closed kylebgorman closed 9 months ago

kylebgorman commented 1 year ago

Investigate this breakage:

    @pytest.mark.skipif(not can_connect_to_wiktionary(), reason="need Internet")
    def test_american_english_dialect_selection():
        # Pick a word for which Wiktionary has dialect-specified pronunciations
        # for both US and non-US English.
        word = "mocha"
        html_session = requests_html.HTMLSession()
        response = html_session.get(
            _PAGE_TEMPLATE.format(word=word), headers=HTTP_HEADERS
        )
        # Construct two configs to demonstrate the US dialect (non-)selection.
        config_only_us = config_factory(key="en", dialect="US | American English")
        config_any_dialect = config_factory(key="en")
        # Apply each config's XPath selector.
        results_only_us = response.html.xpath(config_only_us.pron_xpath_selector)
        results_any_dialect = response.html.xpath(
            config_any_dialect.pron_xpath_selector
        )
>       assert (
            len(results_any_dialect)  # containing both US and non-US results
            > len(results_only_us)  # containing only the US result
            > 0
        )
E       AssertionError: assert 2 > 2
E        +  where 2 = len([<Element 'li' >, <Element 'li' >])
E        +  and   2 = len([<Element 'li' >, <Element 'li' >])

tests/test_wikipron/test_config.py:202: AssertionError
kylebgorman commented 1 year ago

The breakage indicates that even with dialect selection enabled at US | American English you actually obtain all pronunciations. E.g. for this page used in the tests, we grab both elements under the Pronunciation header even though the latter does not match the dialect specification.

kylebgorman commented 1 year ago

This is currently blocking #509.

kylebgorman commented 1 year ago

Hi @jacksonllee sorry to bother, any intuitions about what's going on here? I suspect the failure of Latin to grab anything in #509 is related too.

kylebgorman commented 9 months ago

The issue seems to be that the dialect selector wants @class = "ib-content qualifier-content" but it's now just @class = "ib-content". I'll try this fix out and report back in a few days.