CUNY-CL / wikipron

Massively multilingual pronunciation mining
Apache License 2.0
315 stars 71 forks source link

Fix dialect xpath selector #548

Closed mmcauliffe closed 2 months ago

mmcauliffe commented 2 months ago

It looks like wiktionary changed their formatting from

<span class="ib-content qualifier-content">
        <a href="https://en.wikipedia.org/wiki/American_English" class="extiw" title="w:American English">US</a>
</span>

to

<span class="ib-content qualifier-content">
    <span class="usage-label-accent">
        <a href="https://en.wikipedia.org/wiki/American_English" class="extiw" title="w:American English">US</a>
    </span>
</span>

i.e., adding another span with "usage-label-accent" around the link. I've changed the dialect xpath selector to still be based on the "ib-content" span but allow intermediate nodes between that span and the link to the dialect. Alternatively, we could change it use the new "usage-label-accent" class, but I don't know how reliably it's used across languages if it's a new thing.

Resolves #545

kylebgorman commented 2 months ago

This test isn't passing for me. Here's what I did:

git clone git@github.com:mmcauliffe/wikipron.git
cd wikipron
git checkout fix-ipa-selector
git pull origin fix-ipa-selector
pip install .
pytest -v -k test_american_english_dialect_selection tests/test_wikipron/test_config.py

This prints out (slightly truncated for readability):

tests/test_wikipron/test_config.py::test_american_english_dialect_selection FAILED                                                    [100%]

================================================================= FAILURES ==================================================================
__________________________________________________ test_american_english_dialect_selection __________________________________________________

    @pytest.mark.skipif(not can_connect_to_wiktionary(), reason="need Internet")
    def test_american_english_dialect_selection():
        # Pick a word for which Wiktionary has dialect-specified pronunciations
        # for both US and non-US English.
        word = "mocha"
        html_session = requests_html.HTMLSession()
        response = html_session.get(
            _PAGE_TEMPLATE.format(word=word), headers=HTTP_HEADERS
        )
        # Construct two configs to demonstrate the US dialect (non-)selection.
        config_only_us = config_factory(key="en", dialect="US | American English")
        config_any_dialect = config_factory(key="en")
        # Apply each config's XPath selector.
        results_only_us = response.html.xpath(config_only_us.pron_xpath_selector)
        results_any_dialect = response.html.xpath(
            config_any_dialect.pron_xpath_selector
        )
>       assert (
            len(results_any_dialect)  # containing both US and non-US results
            > len(results_only_us)  # containing only the US result
            > 0
        )
E       AssertionError: assert 2 > 2
E        +  where 2 = len([<Element 'li' >, <Element 'li' >])
E        +  and   2 = len([<Element 'li' >, <Element 'li' >])

tests/test_wikipron/test_config.py:202: AssertionError
kylebgorman commented 2 months ago

After installing I also tried this, which didn't work:

$ wikipron --dialect "US | General American" English | grep aboard
INFO: Language: 'English'
INFO: No cut-off date specified
INFO: Dialect(s): 'US | General American'
INFO: 100 pronunciations scraped
INFO: 200 pronunciations scraped
INFO: 300 pronunciations scraped
INFO: 400 pronunciations scraped
INFO: 500 pronunciations scraped
INFO: 600 pronunciations scraped
INFO: 700 pronunciations scraped
INFO: 800 pronunciations scraped
INFO: 900 pronunciations scraped
aboard  ə ˈb ɔː d
aboard  ə ˈb ɔ ɹ d
...

This is wrong because clearly the former is an RP pronunciation and the latter is MAE but only the latter should have been yielded here.

mmcauliffe commented 2 months ago

Ok should be fixed now, fixed a bug introduced in #513 where @class = "ib-content qualifier-content" selectors were changed to @class = "ib-content qualifier-content", but needed to be contains(@class, "ib-content") since the equals does an exact match on the classes, so the count(span[@class = "ib-content"]) = 0 condition was always true and did not filter out any dialects.

kylebgorman commented 2 months ago

That's it! Both the unit test and my informal "aboard" test are now working.

I believe this is the issue with Twine. I'm going to test this by updating the requirements in your PR...this should allow us to get to the unit test phase of the CI...

kylebgorman commented 2 months ago

Ok should be fixed now, fixed a bug introduced in #513 where @class = "ib-content qualifier-content" selectors were changed to @class = "ib-content qualifier-content", but needed to be contains(@class, "ib-content") since the equals does an exact match on the classes, so the count(span[@class = "ib-content"]) = 0 condition was always true and did not filter out any dialects.

Thanks for figuring out the blame there, I figured it was probably going to be me :)