Closed xrotwang closed 3 years ago
The result of mapping the raw list of all distinct glosses:
Maybe the list of concepts is in this book.
They have only 1133 glosses. Should check this. 1133 is straightforward in mapping 22xx less.
Important note, they say on the comparalex homepage:
The “Tanzanian Language Survey” 1,000-word word list was used to elicit data for 100 Eastern Bantu languages. Those data are currently being made available through the Comparative Bantu Online Dictionary project (CBOLD project).
So we have our list here.
@LinguList This is what I parsed:
Here's the code:
from clldutils.dsv import UnicodeWriter
from clldutils.misc import nfilter
from bs4 import BeautifulSoup as bs
def main(html):
html = bs(html, 'html5lib')
for tr in html.find('table', class_='scroll').find_all('tr'):
yield [td.text for td in tr.find_all('td')]
if __name__ == '__main__':
import io
import sys
fname, prefix = sys.argv[1:3]
with io.open(fname, encoding='utf8') as fp:
rows = nfilter(main(fp.read()))
prefix += '-{0}'.format(len(rows))
with UnicodeWriter(prefix + '.tsv', delimiter='\t') as writer:
writer.writerow(['ID', 'NUMBER', 'ENGLISH', 'FRENCH', 'CONCEPTICON_ID', 'CONCEPTICON_GLOSS'])
for row in rows:
row[0] = prefix + '-' + row[0]
writer.writerow(row + ['', ''])
run on the HTML of a saved page as
python <script> <path/to/html> Nurse-1975
we have the same, I just gave the file the wrong name with "1137" instead of "1133". I did my parse manually by then, yours is much more consistent.
This still leaves us with the question how to map the actual glosses found in the CBOLD data to the cndensed list of 1133. In the data we have "(biting)worms" and "(black)pepper", both of which can't be found in this 1133 item list, although they probably could be matched to Concepticon.
This is again the typical mess we always encounter. I'd say: we NEED the PDFs of the original data to compare. Then we can decide: do we treat it as two sources, do we ignore it, do we deal with CBOLD as an indirect copy of the source, or a source of itself, etc. But we need to find the pdf of the TLS, I'd say, in order to advance on this.
Yeah, I was just thinking the same: As simple as Concepticon is, it's amazing how people could do without and why they didn't come up with something consistent before.
and we are still far from being consistent with concepticon...
Pragmatically, I'd say we treat CBOLD's TLS as a separate source, because AFAICT it's the only digital version of the actual word lists. So unless we want to retro-digitize the TLS again, this is what we'd work with in lexibank, right?
Attached is the result of automatically mapping the 1574 glosses found in the FoxPro database:
fine with me!
that "fine with me" was regarding an earlier post. Mapping this will take time, and I'd give other data priority. I just spotted: they have things like brother/sister, but automatic link is on "brother", should be "sibling", etc. Also the pronouns are messed up. How can people create so many zipfian distributions in working on something?
The TLS data is still without a concept list, although we have a rudimentarily added list there. Could we map the current list to Concepticon in the name of Nurse etc? The problem is that if we do not do so, any changes in concepticon may require us to check again in the lexibank dataset. So I'd say, adding this for 2.5 of concepticon is something we really must do.
The concept list for the TLS.NursePhillipson1975 dataset published on CBOLD.