concepticon / concepticon-data

The curation repository for the data behind Concepticon.
https://concepticon.clld.org
32 stars 35 forks source link

Some URLs (parts) in Zalizniak-2020-2590 seem to be incorrect #1217

Closed xrotwang closed 5 months ago

xrotwang commented 2 years ago

Triggered by my investigation of the URLs in the Bulakh list (see #1216), I looked at the URLs in Zalizniak-2020-2590. The first one I looked up was shift328, given for concept taste (n.), which seems to be an error. I checked a couple more, and these seemed to be fine.

So, I guess, what's to be done here is

xrotwang commented 2 years ago

We may also want to change the data in this list to a more meaningful format, e.g. modeling semantic shifts as JSON array of objects.

LinguList commented 2 years ago

How awful, I just saw that the datatype was somehow messed up: https://concepticon.clld.org/values/Zalizniak-2020-2590-1

The idea was to model this (as other network-like datatypes with relations) along these lines: https://calc.hypotheses.org/2617

I am open to JSON, but we'd need to update some of the tutorials then (or write a new one) and specifically these usecases, showing how I get a graph out of a dataset like the one by Zalizniak or MultiSimlex (Vulic) would be important then.

LinguList commented 2 years ago

I now see: the List-representation is following the fact that I specified the data format in CSVW. It is not an error on the side of the original data, which follows the format I describe in the blog post.

xrotwang commented 2 years ago

Ah, good point. I'll have an eye on this blog post when proposing a new data model. We'd have to rewrite such material anyway, considering the split between Concepticon and NoRaRe.

LinguList commented 2 years ago

Yes. In fact, given that this was some work that I did in a fashion you could consider "without peer review", as I just set it up and proposed it without alternatives, it is time to get this peer reviewed ;)

LinguList commented 2 years ago

And accordingly enhanced.

xrotwang commented 2 years ago

So, I'd say the data in Concepticon can stay as it is (and hence the blog post stays valid), but a NoRaRe extract of the Zalizniak data would have a JSON representation making some of the aggregation explained in the blog post unneccessary.

xrotwang commented 2 years ago

@LinguList what do you think about errors in the data? Could these be systematic? Is it worth fixing?

xrotwang commented 2 years ago

@LinguList @AnnikaTjuka and shouldn't the graph of shifts be added to NoRaRe, too?

LinguList commented 2 years ago

That would of course be beautiful. Then we could add this graph, the Bulakh graph, the CLICS graphs, etc. as well!

LinguList commented 2 years ago

We could zip (to save storage place) and use JSON or GML format.

LinguList commented 2 years ago

As to the errors: I am afraid they are unsystematic, I don't know what drives their decisions.

LinguList commented 2 years ago

This is how shift 328 looked in December 2020.

LinguList commented 2 years ago

Wayback Machine confirms this.

LinguList commented 2 years ago

As to the format: GML -- though differently interpreted currently -- has the advantage of working with Cytoscape, one of the more important visualization tools for networks (which exports to stable HTML with d3, as well).

xrotwang commented 2 years ago

To construct the full graph from the tabular data, we'd need some assembly anyway, so I'd vote for simple JSON columns in the tabular data, plus a custom cldfbench command (in norare-cldf) which assembles GML graphs for the relevant datasets.

chrzyki commented 7 months ago

I've been periodically trying to automatically verify the links and investigate the URL issue, but DatSemShift has consistently been giving me HTTP 504 in the last couple of days. Should this be postponed until later?

The other question regarding JSON columns is more or less settled, right?

xrotwang commented 5 months ago

URLs have been removed from the 2020 list.