cldf-datasets / doreco

CLDF dataset derived from DoReCo's core corpus
https://doreco.info/
3 stars 0 forks source link

Create languages.tsv #2

Closed FredericBlum closed 2 years ago

FredericBlum commented 2 years ago

What information can or should we put in the languages.tsv file?

From the DoReCo mainpage, we have the following options:

I would like to add at least the citation key for the individual corpora and the information about glossing. This could make it easier to filter for specific studies etc., and the citation key assures (hopefully) that people who use the corpus cite the individual corpus creators.

xrotwang commented 2 years ago

I think all we need is the Glottocode and the download link(s) plus license. The other information should be read from the metadata files in the downloads (and the license info therein should be checked against the one we keep in languages.tsv).

FredericBlum commented 2 years ago

The following information is now provided within a single metadata file in Version 1.1:

Language | Glottocode | iso-639-3 | Family | fam_glottocode | Area | Creator | Latitude | Longitude | Archive | Archive_link | Translation | Annotation license | Audio license | DOI | Gloss | Extended speakers | Extended word tokens | Extended texts | Core speakers | core word tokens | Core texts | Years of recordings in core set

We could either go for a reduced set (Name, Glottocode, License), or for the full information. This would mean a lot of custom columns, but I think it would be reasonable to include this data. Also, this means that we would no longer require reading in the information from the individual metadata files.

xrotwang commented 2 years ago

Yes, adding all this info seems reasonable. Potentially, some of it would go into a ContributionTable, though.

Frederic Blum @.***> schrieb am Do., 25. Aug. 2022, 08:49:

The following information is now provided within a single metadata file in Version 1.1:

Language | Glottocode | iso-639-3 | Family | fam_glottocode | Area | Creator | Latitude | Longitude | Archive | Archive_link | Translation | Annotation license | Audio license | DOI | Gloss | Extended speakers | Extended word tokens | Extended texts | Core speakers | core word tokens | Core texts | Years of recordings in core set

We could either go for a reduced set (Name, Glottocode, License), or for the full information. This would mean a lot of custom columns, but I think it would be reasonable to include this data. Also, this means that we would no longer require reading in the information from the individual metadata files.

— Reply to this email directly, view it on GitHub https://github.com/cldf-datasets/doreco/issues/2#issuecomment-1226848125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKHP65QVIOZR2CYS3ELV24JRJANCNFSM55KDSK6Q . You are receiving this because you commented.Message ID: @.***>

FredericBlum commented 2 years ago

I added the Languages with success, same with the ContributionTable. What I am currently failing at, however, is adding a new MetadataTable. I added the component, but I fail to create the necessary metadata-json. Could you point me at an example from some other repository or documentation where I can find this? I've looked in several and couldn't identify what is missing.

Code: https://github.com/cldf-datasets/doreco/blob/main/cldfbench_doreco.py#L62-L288

xrotwang commented 2 years ago

MetadataTable is no CLDF component, i.e. this type of data isn't standardized in CLDF. So you'd just add another custom table via

cldf.add_table('metadata.csv', **columns)

and populate it via

args.writer.objects['metadata.csv'].append(dict)
FredericBlum commented 2 years ago

Thank you, so the main problem was that I had `objects['MetadataTable'] instead of the CSV-file. Now everything works fine.

FredericBlum commented 2 years ago

Then we can probably close this issue as well