D-PLACE / dplace-data

The data repository for the D-PLACE Project (Database of Places, Language, Culture and Environment)
https://d-place.org
Creative Commons Attribution 4.0 International
77 stars 37 forks source link

Consider shifting to lexibank style format #284

Closed SimonGreenhill closed 11 months ago

SimonGreenhill commented 4 years ago

i.e with one repos per dataset. This seems to be much more maintainable and extendable.

xrotwang commented 4 years ago

Yes, I agree. And now - with pydplace in its own repos, and the aggregated data being pushed to dplace-cldf - this will be a lot simpler. So dplace-data would be - much like CLICS - just a requirements.txt, listing the datasets to be aggregated.

xrotwang commented 4 years ago

Oh, actually, dplace-data should just be superseded by dplace-cldf which will include the requirements.txt.

xrotwang commented 4 years ago

hm. Hold on. Where would the society sets live? Should this be the contribution from dplace-data - which would then take on a role like Concepticon - mapping partial society lists to D-PLACE's xd_id?

SimonGreenhill commented 4 years ago

Yeah, I was thinking of this as I was wondering about adding a new dataset. It just seemed wrong to add it to this one.

We could keep this as something like a "DPLACE core" and map via xd_id?

xrotwang commented 4 years ago

We were discussing "Society sets" as a separate entity in D-PLACE anyway - including some description, etc. So although it may seem over-engineered, I quite like the idea of a "Ethnicon" (?). This would also make the somewhat implicit xd_id more transparent. And I think the analogy with Concepticon also carries further: People actually did use - e.g. the Binford - society set to collect more datapoints.

SimonGreenhill commented 4 years ago

I guess it'd be more like the equiv. of glottolog so .. ethnilog :) But yes, I think that system makes sense (and we have a well working exemplar in concepticon/glottolog/lexibank.

kirbykat commented 4 years ago

Hmm, I don't totally understand the proposal, maybe because I am not familiar with Concepticon's structure. Could we have a call to discuss this, before any decisions are made? Here are my thoughts.

I still like the idea of society sets. I am close to being able to contribute two more major society sets- the eHRAF and Global Jukebox society sets. My plan has been to assign cultures/ethic groups in these new society sets unique society ids (soc_id), and to link them to existing society sets using the xd_id method. I admit that one problem with the new society sets (especially eHRAF) is the existence of a large number of "general" cases (lumped versions of societies that are split in other datasets). This means that there are instances where a single soc_id in these datasets could potentially be mapped to two or more existing xd_ids (i.e., there is a 1:many relationship for soc_ids to xd_ids for the first time). I have developed a system for dealing with this, where basically I split the eHRAF society units into subcases that map to a single xd_id before importing to D-PLACE, with clear documentation/links back to the lumped society id, but it would definitely be good to discuss this with you.)

Re: the idea of a cross-dataset ethnic group identifier (if that is what is being proposed?) - the D-PLACE group discussed on multiple occasions the pros and cons of having a sort of "ethnic group" identifier, and decided against it. It may be that this is now a semantic argument, because in a way this is what 'xd_id' does/is, except that we define 'xd_id' as an identifier that can help users identify cultural units across datasets that in SOME CONTEXTS can be considered equivalent. "Context" here depends on the research question, and its sensitivity to differences in time focus, small differences in geographic focus (i.e., location/settlement in which cultural data were collected) , and/or differences in ethnographer/observer.

@xrotwang, in cases like the one you describe, where someone has coded more data for societies in the Binford society set, my temptation would be to give these new observations the Binford society id (soc_id) as their unique society identifier. The new data would be considered a new 'dataset', but would be mapped to an existing 'society set'.

One change I would like to see to the structure (and again, maybe this is being suggested), is a single file for linking xd_ids to glottocodes. Individual datasets would list soc_ids and the xd_ids they refer to, but would not link soc_ids to language directly. The advantage of this is that updates to language matches would only have to be applied once (to the master xd_id --> glottocode file), instead of having to be applied within each dataset (i.e., see this recent case: https://github.com/D-PLACE/dplace-data/pull/277, in which xd1396 was changed from leng1262 to nort2971). Because xd1396 appears in two society sets, the change had to be applied separately to the EA and SCCS repos (xd1396 maps to EA: Sh9, SCCS: SCCS182).

Final comment: I realize the current system is complicated and a bit opaque to outside users. So, I'm not set against a total re-org, but I am wary of giving up the system without discussion!

xrotwang commented 4 years ago

A call about these questions would be good, yes!

But I'll just illustrate a bit, how Concepticon approaches similar issues: There actually is an analogous discussion to the "do we want to create ethnic group identifiers"-issue for Concepticon. Often, people mistake Concepticon for an ontology - i.e. a set of all the things/names/concepts possible. But that's not what Concepticon is. Instead, it is a collection of all concepts that have actually been used to elicit and collect lexical data. So Concepticon is a purely pragmatic device. And this focus on pragmatic usefulness makes some problems a lot easier. E.g. there are wordlists that contain words for FOOT OR LEG and others that were more specific. Now rather than having to decide whether "foot or leg" actually is a thing, Concepticon will just include the concept set "FOOT OR LEG", as well as the more specific one (and add a relation between these). So "Ethnicon" could include "ethnonyms" for all of the things in ehraf as well as all the more fine-grained societies and add "narrower/broader" relations. Then

Of course, as soon as ethnic entities get as big as to span multiple languages, the added benefit of mapping ethnonyms to languages gets a bit more difficult. The corresponding data structure in Concepticon is called "concept set metadata" - things like mappings from concepticon to wikipedia. But then, such a mapping needn't be defined for all concept sets, so "FOOT OR LEG" probably won't have a match in wikipedia.

xrotwang commented 4 years ago

A final note on what Concepticon helps us with: It takes away the burden for each individual dataset to define its relations to the rest of the world of words. Of course, the price for this is that these relations are also somewhat outside of the control of individual datasets. But it turns out that often know-how of what concepts have been used by others trumps know-how about the particular dataset. E.g. often data collectors will say they used the "Swadesh list" but looking at the concepts reveals that they actually used the variant where "burn" is transitive, rather than intransitive ...

kirbykat commented 4 years ago

@xrotwang thanks - the 'FOOT OR LEG' case is a helpful analogy. Your follow-up comment also resonates with my experience. Do I understand right that an Ethnicon identifier would effectively 'collapse' soc_id and xd_id?

In any case, I will think about this a bit more. How about a call next week sometime? Or sooner, if this is something your team is currently working on?

xrotwang commented 4 years ago

There would be one ethnicon ID for each society described in any dataset. Sometimes it would be possible to reuse IDS- e.g. if a dataset explicitly uses the binford societies. But the big ehraf groups would get their own ID rather than being defined as set of smaller groups. This would make remapping and reevaluation of mappings easier.

kirbykat notifications@github.com schrieb am Do., 1. Okt. 2020, 18:35:

@xrotwang https://github.com/xrotwang thanks - the 'FOOT OR LEG' case is a helpful analogy. Your follow-up comment also resonates with my experience. Do I understand right that an Ethnicon identifier would effectively 'collapse' soc_id and xd_id?

In any case, I will think about this a bit more. How about a call next week sometime? Or sooner, if this is something your team is currently working on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/D-PLACE/dplace-data/issues/284#issuecomment-702255989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKBDUXSTHDA2PWF7Y7LSISVVRANCNFSM4SAEYKKQ .

SimonGreenhill commented 4 years ago

@kirbykat: No decisions are being made -- just thinking long term here (i.e. I don't want to deal with this until 2021). We've found that the system described above works well and is clearer and easier to maintain/add things (especially as it gets rid of the multiple mappings like the soc_id -> xd_id in many places you mention).

@xrotwang, given the stakeholders here I suspect we should write up a "DPLACE 2.0" structure document?

xrotwang commented 4 years ago

Yes, laying out the ideas in a document would make sense. So maybe a call next week, and a doc then to kick things off?

SimonGreenhill commented 4 years ago

sounds good -- when suits you both?

kirbykat commented 4 years ago

how about Wed. Oct. 7th afternoon (German time) - say, 3:15 pm? (feel free to suggest another time, throwing this out there).

xrotwang commented 3 years ago

@SimonGreenhill so it's 2021 now (for a couple more weeks). Should we deal with this now?

xrotwang commented 11 months ago

D-PLACE datasets live now in their own individual repositories. The ones with a file cldf/societies.csv are the ones which also come with a "societiy set". Phylogenies have moved to Phlorest.