concepticon / concepticon-data

The curation repository for the data behind Concepticon.
https://concepticon.clld.org
32 stars 37 forks source link

Nurse-1975-? #272

Closed xrotwang closed 3 years ago

xrotwang commented 7 years ago

The concept list for the TLS.NursePhillipson1975 dataset published on CBOLD.

xrotwang commented 7 years ago

The result of mapping the raw list of all distinct glosses:

Nurse-1975-x.tsv.txt

xrotwang commented 7 years ago

Maybe the list of concepts is in this book.

LinguList commented 7 years ago

They have only 1133 glosses. Should check this. 1133 is straightforward in mapping 22xx less.

Nurse-1975-1137.txt

LinguList commented 7 years ago

Important note, they say on the comparalex homepage:

The “Tanzanian Language Survey” 1,000-word word list was used to elicit data for 100 Eastern Bantu languages. Those data are currently being made available through the Comparative Bantu Online Dictionary project (CBOLD project).

LinguList commented 7 years ago

So we have our list here.

xrotwang commented 7 years ago

@LinguList This is what I parsed:

Nurse-1975-1133.tsv.txt

xrotwang commented 7 years ago

Here's the code:

from clldutils.dsv import UnicodeWriter
from clldutils.misc import nfilter
from bs4 import BeautifulSoup as bs

def main(html):
    html = bs(html, 'html5lib')
    for tr in html.find('table', class_='scroll').find_all('tr'):
        yield [td.text for td in tr.find_all('td')]

if __name__ == '__main__':
    import io
    import sys
    fname, prefix = sys.argv[1:3]
    with io.open(fname, encoding='utf8') as fp:
        rows = nfilter(main(fp.read()))
    prefix += '-{0}'.format(len(rows))
    with UnicodeWriter(prefix + '.tsv', delimiter='\t') as writer:
        writer.writerow(['ID', 'NUMBER', 'ENGLISH', 'FRENCH', 'CONCEPTICON_ID', 'CONCEPTICON_GLOSS'])
        for row in rows:
            row[0] = prefix + '-' + row[0]
            writer.writerow(row + ['', ''])

run on the HTML of a saved page as

python <script> <path/to/html> Nurse-1975
LinguList commented 7 years ago

we have the same, I just gave the file the wrong name with "1137" instead of "1133". I did my parse manually by then, yours is much more consistent.

xrotwang commented 7 years ago

This still leaves us with the question how to map the actual glosses found in the CBOLD data to the cndensed list of 1133. In the data we have "(biting)worms" and "(black)pepper", both of which can't be found in this 1133 item list, although they probably could be matched to Concepticon.

LinguList commented 7 years ago

This is again the typical mess we always encounter. I'd say: we NEED the PDFs of the original data to compare. Then we can decide: do we treat it as two sources, do we ignore it, do we deal with CBOLD as an indirect copy of the source, or a source of itself, etc. But we need to find the pdf of the TLS, I'd say, in order to advance on this.

xrotwang commented 7 years ago

Yeah, I was just thinking the same: As simple as Concepticon is, it's amazing how people could do without and why they didn't come up with something consistent before.

LinguList commented 7 years ago

and we are still far from being consistent with concepticon...

xrotwang commented 7 years ago

Pragmatically, I'd say we treat CBOLD's TLS as a separate source, because AFAICT it's the only digital version of the actual word lists. So unless we want to retro-digitize the TLS again, this is what we'd work with in lexibank, right?

xrotwang commented 7 years ago

Attached is the result of automatically mapping the 1574 glosses found in the FoxPro database:

Nurse-1975-1574.tsv.txt

LinguList commented 7 years ago

fine with me!

LinguList commented 7 years ago

that "fine with me" was regarding an earlier post. Mapping this will take time, and I'd give other data priority. I just spotted: they have things like brother/sister, but automatic link is on "brother", should be "sibling", etc. Also the pronouns are messed up. How can people create so many zipfian distributions in working on something?

LinguList commented 3 years ago

The TLS data is still without a concept list, although we have a rudimentarily added list there. Could we map the current list to Concepticon in the name of Nurse etc? The problem is that if we do not do so, any changes in concepticon may require us to check again in the lexibank dataset. So I'd say, adding this for 2.5 of concepticon is something we really must do.