concepticon / concepticon-data

The curation repository for the data behind Concepticon.
https://concepticon.clld.org
35 stars 37 forks source link

Bates-1985-XXXX list on Australian languages #273

Open LinguList opened 7 years ago

LinguList commented 7 years ago

This list is currently digitized in this project: http://www.bates.org.au/

we should definitely link to the website and the concepts (a bit cryptic), but including the original source would also be nice. It is still not entirely clear how many concepts they used.

xrotwang commented 7 years ago

This site should be a clld app :)

LinguList commented 7 years ago

:-)

LinguList commented 7 years ago

Just realized: they have a wordlist there, and they link to scans of the original data, which facilitates our work (they even include pagenames).

nthieberger commented 6 years ago

I'm happy to talk about ways of moving the bates data to another platform!

LinguList commented 6 years ago

The natural way is to convert to cldf format, from where it is rather easy for our team to help in turning this into a clld application (similar to WOLD, the World Atlas of language Structures, etc.). If you are interested, @nthieberger, I suggest we pursue the discussion via email, including @xrotwang and some other people from Jena, who will gladly help/share their opinions. You find my email here

xrotwang commented 6 years ago

@nthieberger just looked at the site again, and yes, it would be a good exercise to figure out how much of this data can be captured in CLDF right away. Presumably the link into the scans could be stored using the IIIF image API - maybe using a valueUrl on a column.

Conal-Tuohy commented 6 years ago

I'm the main developer on the Bates project.

I've had a rather quick look at the CLDF ontology and I agree it seems fairly straightforwardly compatible with the Bates data model.

If I understand correctly, the CLDF spec uses an RDF data model, but mandates the use of CSV on the Web as the only allowable serialization. So we would need to express the data in CSV form (and possibly also BibTeX for linking to source data?)

My main question I think would be about now to link the words in a word-list to an electronic facsimile of a primary source document. Is the https://cldf.clld.org/v1.0/terms.rdf#source property (along with a ancillary BibTeX document) the solution?

NB the Bates project primarily models the data as a corpus of TEI XML files. The individual words in the TEI files are marked up with language codes (though in fact most of them are still marked with the generic "aus" code -- this tagging in a work in progress). Each of the original word-lists was originally written by hand on paper forms which had been pre-printed with English terms. Years later these word-lists were typed up and in some cases the typescripts were collated from more than one of the word-list manuscripts. Each TEI file contains a manifest of page images, and these images are associated with the words in the word-lists. So a given word will have two TEI XML files and two image files as its sources. These sources are all aggregated together in the website's user interface; it's possible to link to a given word in the context of both of its source documents, e.g. here is a link to "magpie" in the Mirning language in a word-list from the Eucla district:

http://bates.org.au/text/39-078T.html#magpie

Incidentally, we have an RDF dataset already, which uses the SKOS vocabulary to model the words in Bates' original vocabulary as skos:Concept resources, and using skos:altLabel to record the various translations, using RDF's native language tags to record the language of each string literal. Each skos:Concept belongs to a skos:ConceptScheme (i.e. a thesaurus) which represents one particular word-list, and the equivalent concepts in all the concept schemes are related with skos:exactMatch (if I recall correctly) to the corresponding original concept in Bates' (English language) vocabulary. We use skos:note to refer from a concept to the corresponding resource fragment on the website. We use the http://www.w3.org/2003/01/geo/wgs84_pos ontology for geolocations. Plus we have a couple of ad hoc terms which are not formally defined anywhere. Not perhaps the ideal ontology but it does the job, more or less. You can browse the entire RDF dataset here: http://bates.org.au/rdf/

The RDF dataset is currently used only to generate the JSON data files which underlie the map viewer. Each of these JSON files is created by querying a SPARQL graph store containing the dataset, and then formatting the result as JSON using an XSLT. Here's the SPARQL query:

# Each word map represents a top level English word, and includes language equivalents
prefix bates: <http://bates.conaltuohy.com/rdf/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
select 
    ?aboriginalConcept 
    ?aboriginalTerm 
    ?narrowerEnglishTerm ?narrowerAboriginalTerm 
    ?broaderEnglishTerm ?broaderAboriginalTerm 
    ?latitude ?longitude 
    ?note       
where {
    # aboriginal concept matches the English word
    ?aboriginalConcept skos:exactMatch <{$bates-concept-uri}>.
    # aboriginal concept comes from a vocabulary with latitude and longitude
    ?aboriginalLexicon skos:hasTopConcept ?aboriginalConcept.
    ?aboriginalLexicon geo:lat ?latitude.
    ?aboriginalLexicon geo:long ?longitude.
    # the aboriginal concept has a note referring to the HTML transcript
    ?aboriginalConcept skos:note ?note.
    # Then either this concept has an altLabel itself,
    # or it is connected to a narrower or broader concept which in turn has 
    # an English prefLabel and one or more aboriginal altLabels
    # each aboriginalConcept has a label (which is the aboriginal word) 
    {
        ?aboriginalConcept skos:altLabel ?aboriginalTerm.
    }
    union
    {
        ?aboriginalConcept skos:narrower ?narrowerConcept.
        ?narrowerConcept skos:prefLabel ?narrowerEnglishTerm.
        ?narrowerConcept skos:altLabel ?narrowerAboriginalTerm.
    }
    union
    {
        ?aboriginalConcept skos:broader ?broaderConcept.
        ?broaderConcept skos:prefLabel ?broaderEnglishTerm.
        ?broaderConcept skos:altLabel ?broaderAboriginalTerm.
    }
}
xrotwang commented 6 years ago

@Conal-Tuohy thanks for this info! Regarding links to the electronic facsimiles: I think I'd add two columns to the forms table:

{
    "name": "typedSource",
    "propertyUrl": "http://purl.org/dc/terms/source",
    "valueUrl": "http://staging.bates.org.au/images/{typedSource}"
}

and link to a BibTeX record for the facsimile only if this exists in some published form.

So given a value of 39/39-085T.jpg for the typedSource column, we can get to the image at http://staging.bates.org.au/images/39/39-085T.jpg

xrotwang commented 6 years ago

@Conal-Tuohy you don't have pixel-coordinates for the location of words within the facsimile pages, do you?

xrotwang commented 6 years ago

Thanks again @Conal-Tuohy for pointing me to the RDF data. Starting from there, I'll have a go at extracting a concept list suitable for adding to Concepticon. AFAIU, this list would comprise every skos:Concept in http://bates.org.au/rdf/bates.xml plus every skos:Concept in any of the individual wordlists, which doesn't have an exactMatch.

Conal-Tuohy commented 6 years ago

@xrotwang we don't have pixel co-ords for the location of words in the facsimile images, no.

We do have identifiers for the words in the TEI XML though, so we could point directly to the words in the transcript (i.e. in the web pages). That why I'd think it preferable to point to the words as they appear in the web pages rather than to the images. I agree that dcterms:source is a reasonable predicate to use.

xrotwang commented 6 years ago

@Conal-Tuohy ok, I see. The image URLs - being hosted on staging.* - look like they are not meant to be persistent anyway, right?

Conal-Tuohy commented 6 years ago

Regarding your comment about the "concepticon", @xrotwang, we do also have a task to organise the entire set of concepts into a single thesaurus. So far we've concentrated on the original Bates vocabulary of terms, but @nthieberger has also done some work on semantically aligning terms beyond that scope, though it isn't yet reflected in the RDF dataset and hence in the website.

xrotwang commented 6 years ago

Yes, putting everything (also what seems to be whole sentences) into a single thesaurus looks like a project - I counted 14,300 different skos:Concept with either no skos:exactMatch or with a skos:exactMatch not yet in bates.xml.

xrotwang commented 6 years ago

If I restrict the concepts missing in bates.xml to ones which are represented in at least 5 wordlists, I'm down to 1217 distinct glosses/labels/ids. Maybe this would be a good starting point for a concepticon list. What do you think @LinguList ?

xrotwang commented 6 years ago

Actually, sticking to the 1829 concepts in the questionnaire may be the better contribution to Concepticon, since it comes with a bit of documentation.

LinguList commented 6 years ago

Yes, this sounds good!

LinguList commented 6 years ago

We may end up linking only half of it now, but if you can provide a first list with ID and GLOSS, here, we can run the check in automatic concepticon mapping...

xrotwang commented 6 years ago

Ok, I'll pipe this through concept linking, and see where this gets us.

LinguList commented 6 years ago

I'd be happy with > 70%, which could then break down to 50% in the end, which would still be some 1000 or so concepts...

xrotwang commented 6 years ago

Ok, here's the output of concepticon map_concepts. Bates-1904-1829.txt

Whoever has time to check the mappings, feel free to take a stab.

xrotwang commented 6 years ago

Oh, and yes, @LinguList , your ballpark numbers seem about right. 73% now, will come down due to very detailed body parts, etc.

nthieberger commented 6 years ago

Some very strange mappings in this list,

This presumably refers to the verb 'to wind' and not to the noun 'wind' Bates-1904-1829-639 639 Wind (East) #wind-east 1113 WRAP 4

From line 1668 it goes very wrong, with occasional correct mappings (e.g. 14928, 13499) Bates-1904-1829-859 859 Cut, to, with native hammer

cut-to-with-native-hammer 203 MARRY 4

On Mon, 5 Nov 2018 at 22:19, Robert Forkel notifications@github.com wrote:

Ok, here's the output of concepticon map_concepts. Bates-1904-1829.txt https://github.com/clld/concepticon-data/files/2548138/Bates-1904-1829.txt

Whoever has time to check the mappings, feel free to take a stab.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clld/concepticon-data/issues/273#issuecomment-435840231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlUoOfShkTbozhYZ7nVffP1DBr9UAfZks5usB7KgaJpZM4LKkNG .

xrotwang commented 6 years ago

@nthieberger yes, as soon as it gets to multi-word glosses, the automatic mapping isn't very useful anymore. For rather unusual lists like this one, the main advantage of the automatic mapping is that it provides you with a syntactically correct concept list - most of the actual mappings require human inspection ...

LinguList commented 6 years ago

Yes, exactly. The algorithm is useful, but we are discussing how to revise it in a principled manner. It takes info from past concepticon mappings, but it is lost in cases where the glosses become idiosyncratic, or have just not been mapped before.

LinguList commented 6 years ago

BTW: if you, @nthieberger or @Conal-Tuohy want to look into these mappings: this tutorial explains the output format, and what to do with it. Furthermore, here you find an interface that allows for quick search, including fuzzy matches, in case you want to help us mapping the glosses.

Conal-Tuohy commented 6 years ago

I've done a draft mapping to CLDF, available here: http://staging.bates.org.au/cldf/

So far this is a FormTable only. I've got a Parameter_ID column but no corresponding ParameterTable just yet.

It's also "metadata-free" still, but I understand I will need metadata in order to meaningfully add a column containing links from each form to the corresponding part of the bates.org website.

For now I have excluded forms which are marked up with only the generic language code "aus" (i.e. some Australian indigenous language), which actually excludes the bulk of the corpus. I'm looking now to see if I can add some language codes to a few more texts, quickly, and update the CSV.

I'd appreciate any and all feedback on this.

xrotwang commented 5 years ago

Looks good as a first step. And as you say, adding a short metadata file would be the next step. This would allow you to refactor the identifier into more readable local IDs, e.g. replace http%3A%2F%2Fbates.org.au%2Frdf%2F40-042T.xml%23abduct-to with 40-042T.xml#abduct-to and keep the URL info as valueUrl property in the metadata:

        {
            "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#FormTable",
            "tableSchema": {
                "columns": [
                    {
                        "datatype": "string",
                        "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
                        "valueUrl": "http://bates.org.au/rdf/{ID}",
                        "required": true,
                        "name": "ID"
                    },
...
xrotwang commented 5 years ago

Then, adding in a ParameterTable would be nice, because this is where the linking to Concepticon would go in. You probably should be using pycldf to cobble together the CLDF dataset. This would allow you to focus on the data, while the metadata would mostly come for free - i.e. be generated by the package appropriately.

Conal-Tuohy commented 5 years ago

Thanks! So, what would pycldf do for me that my SPARQL store doesn't?

xrotwang commented 5 years ago

pycldf would write the Wordlist-metadata.json and do that in a way that is fairly easy to customize, e.g. to add the valueUrl property for the ID column. But I guess, if you start from examples, e.g. the metadata for IDS, adding in a handful of custom properties by hand wouldn't be much of a problem either.