Open LinguList opened 7 years ago
This site should be a clld app :)
:-)
Just realized: they have a wordlist there, and they link to scans of the original data, which facilitates our work (they even include pagenames).
I'm happy to talk about ways of moving the bates data to another platform!
The natural way is to convert to cldf format, from where it is rather easy for our team to help in turning this into a clld application (similar to WOLD, the World Atlas of language Structures, etc.). If you are interested, @nthieberger, I suggest we pursue the discussion via email, including @xrotwang and some other people from Jena, who will gladly help/share their opinions. You find my email here
@nthieberger just looked at the site again, and yes, it would be a good exercise to figure out how much of this data can be captured in CLDF right away. Presumably the link into the scans could be stored using the IIIF image API - maybe using a valueUrl
on a column.
I'm the main developer on the Bates project.
I've had a rather quick look at the CLDF ontology and I agree it seems fairly straightforwardly compatible with the Bates data model.
If I understand correctly, the CLDF spec uses an RDF data model, but mandates the use of CSV on the Web as the only allowable serialization. So we would need to express the data in CSV form (and possibly also BibTeX for linking to source data?)
My main question I think would be about now to link the words in a word-list to an electronic facsimile of a primary source document. Is the https://cldf.clld.org/v1.0/terms.rdf#source property (along with a ancillary BibTeX document) the solution?
NB the Bates project primarily models the data as a corpus of TEI XML files. The individual words in the TEI files are marked up with language codes (though in fact most of them are still marked with the generic "aus
" code -- this tagging in a work in progress). Each of the original word-lists was originally written by hand on paper forms which had been pre-printed with English terms. Years later these word-lists were typed up and in some cases the typescripts were collated from more than one of the word-list manuscripts. Each TEI file contains a manifest of page images, and these images are associated with the words in the word-lists. So a given word will have two TEI XML files and two image files as its sources. These sources are all aggregated together in the website's user interface; it's possible to link to a given word in the context of both of its source documents, e.g. here is a link to "magpie" in the Mirning language in a word-list from the Eucla district:
http://bates.org.au/text/39-078T.html#magpie
Incidentally, we have an RDF dataset already, which uses the SKOS vocabulary to model the words in Bates' original vocabulary as skos:Concept
resources, and using skos:altLabel
to record the various translations, using RDF's native language tags to record the language of each string literal. Each skos:Concept
belongs to a skos:ConceptScheme
(i.e. a thesaurus) which represents one particular word-list, and the equivalent concepts in all the concept schemes are related with skos:exactMatch
(if I recall correctly) to the corresponding original concept in Bates' (English language) vocabulary. We use skos:note
to refer from a concept to the corresponding resource fragment on the website. We use the http://www.w3.org/2003/01/geo/wgs84_pos ontology for geolocations. Plus we have a couple of ad hoc terms which are not formally defined anywhere. Not perhaps the ideal ontology but it does the job, more or less. You can browse the entire RDF dataset here: http://bates.org.au/rdf/
The RDF dataset is currently used only to generate the JSON data files which underlie the map viewer. Each of these JSON files is created by querying a SPARQL graph store containing the dataset, and then formatting the result as JSON using an XSLT. Here's the SPARQL query:
# Each word map represents a top level English word, and includes language equivalents
prefix bates: <http://bates.conaltuohy.com/rdf/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
select
?aboriginalConcept
?aboriginalTerm
?narrowerEnglishTerm ?narrowerAboriginalTerm
?broaderEnglishTerm ?broaderAboriginalTerm
?latitude ?longitude
?note
where {
# aboriginal concept matches the English word
?aboriginalConcept skos:exactMatch <{$bates-concept-uri}>.
# aboriginal concept comes from a vocabulary with latitude and longitude
?aboriginalLexicon skos:hasTopConcept ?aboriginalConcept.
?aboriginalLexicon geo:lat ?latitude.
?aboriginalLexicon geo:long ?longitude.
# the aboriginal concept has a note referring to the HTML transcript
?aboriginalConcept skos:note ?note.
# Then either this concept has an altLabel itself,
# or it is connected to a narrower or broader concept which in turn has
# an English prefLabel and one or more aboriginal altLabels
# each aboriginalConcept has a label (which is the aboriginal word)
{
?aboriginalConcept skos:altLabel ?aboriginalTerm.
}
union
{
?aboriginalConcept skos:narrower ?narrowerConcept.
?narrowerConcept skos:prefLabel ?narrowerEnglishTerm.
?narrowerConcept skos:altLabel ?narrowerAboriginalTerm.
}
union
{
?aboriginalConcept skos:broader ?broaderConcept.
?broaderConcept skos:prefLabel ?broaderEnglishTerm.
?broaderConcept skos:altLabel ?broaderAboriginalTerm.
}
}
@Conal-Tuohy thanks for this info! Regarding links to the electronic facsimiles: I think I'd add two columns to the forms table:
{
"name": "typedSource",
"propertyUrl": "http://purl.org/dc/terms/source",
"valueUrl": "http://staging.bates.org.au/images/{typedSource}"
}
and link to a BibTeX record for the facsimile only if this exists in some published form.
So given a value of 39/39-085T.jpg
for the typedSource
column, we can get to the image at http://staging.bates.org.au/images/39/39-085T.jpg
@Conal-Tuohy you don't have pixel-coordinates for the location of words within the facsimile pages, do you?
Thanks again @Conal-Tuohy for pointing me to the RDF data. Starting from there, I'll have a go at extracting a concept list suitable for adding to Concepticon. AFAIU, this list would comprise every skos:Concept
in http://bates.org.au/rdf/bates.xml plus every skos:Concept
in any of the individual wordlists, which doesn't have an exactMatch
.
@xrotwang we don't have pixel co-ords for the location of words in the facsimile images, no.
We do have identifiers for the words in the TEI XML though, so we could point directly to the words in the transcript (i.e. in the web pages). That why I'd think it preferable to point to the words as they appear in the web pages rather than to the images. I agree that dcterms:source
is a reasonable predicate to use.
@Conal-Tuohy ok, I see. The image URLs - being hosted on staging.*
- look like they are not meant to be persistent anyway, right?
Regarding your comment about the "concepticon", @xrotwang, we do also have a task to organise the entire set of concepts into a single thesaurus. So far we've concentrated on the original Bates vocabulary of terms, but @nthieberger has also done some work on semantically aligning terms beyond that scope, though it isn't yet reflected in the RDF dataset and hence in the website.
Yes, putting everything (also what seems to be whole sentences) into a single thesaurus looks like a project - I counted 14,300 different skos:Concept
with either no skos:exactMatch
or with a skos:exactMatch
not yet in bates.xml
.
If I restrict the concepts missing in bates.xml
to ones which are represented in at least 5 wordlists, I'm down to 1217 distinct glosses/labels/ids. Maybe this would be a good starting point for a concepticon list. What do you think @LinguList ?
Actually, sticking to the 1829 concepts in the questionnaire may be the better contribution to Concepticon, since it comes with a bit of documentation.
Yes, this sounds good!
We may end up linking only half of it now, but if you can provide a first list with ID and GLOSS, here, we can run the check in automatic concepticon mapping...
Ok, I'll pipe this through concept linking, and see where this gets us.
I'd be happy with > 70%, which could then break down to 50% in the end, which would still be some 1000 or so concepts...
Ok, here's the output of concepticon map_concepts
.
Bates-1904-1829.txt
Whoever has time to check the mappings, feel free to take a stab.
Oh, and yes, @LinguList , your ballpark numbers seem about right. 73% now, will come down due to very detailed body parts, etc.
Some very strange mappings in this list,
This presumably refers to the verb 'to wind' and not to the noun 'wind' Bates-1904-1829-639 639 Wind (East) #wind-east 1113 WRAP 4
From line 1668 it goes very wrong, with occasional correct mappings (e.g. 14928, 13499) Bates-1904-1829-859 859 Cut, to, with native hammer
On Mon, 5 Nov 2018 at 22:19, Robert Forkel notifications@github.com wrote:
Ok, here's the output of concepticon map_concepts. Bates-1904-1829.txt https://github.com/clld/concepticon-data/files/2548138/Bates-1904-1829.txt
Whoever has time to check the mappings, feel free to take a stab.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clld/concepticon-data/issues/273#issuecomment-435840231, or mute the thread https://github.com/notifications/unsubscribe-auth/AAlUoOfShkTbozhYZ7nVffP1DBr9UAfZks5usB7KgaJpZM4LKkNG .
@nthieberger yes, as soon as it gets to multi-word glosses, the automatic mapping isn't very useful anymore. For rather unusual lists like this one, the main advantage of the automatic mapping is that it provides you with a syntactically correct concept list - most of the actual mappings require human inspection ...
Yes, exactly. The algorithm is useful, but we are discussing how to revise it in a principled manner. It takes info from past concepticon mappings, but it is lost in cases where the glosses become idiosyncratic, or have just not been mapped before.
I've done a draft mapping to CLDF, available here: http://staging.bates.org.au/cldf/
So far this is a FormTable
only. I've got a Parameter_ID
column but no corresponding ParameterTable
just yet.
It's also "metadata-free" still, but I understand I will need metadata in order to meaningfully add a column containing links from each form to the corresponding part of the bates.org website.
For now I have excluded forms which are marked up with only the generic language code "aus" (i.e. some Australian indigenous language), which actually excludes the bulk of the corpus. I'm looking now to see if I can add some language codes to a few more texts, quickly, and update the CSV.
I'd appreciate any and all feedback on this.
Looks good as a first step. And as you say, adding a short metadata file would be the next step. This would allow you to refactor the identifier into more readable local IDs, e.g. replace http%3A%2F%2Fbates.org.au%2Frdf%2F40-042T.xml%23abduct-to
with 40-042T.xml#abduct-to
and keep the URL info as valueUrl
property in the metadata:
{
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#FormTable",
"tableSchema": {
"columns": [
{
"datatype": "string",
"propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id",
"valueUrl": "http://bates.org.au/rdf/{ID}",
"required": true,
"name": "ID"
},
...
Then, adding in a ParameterTable
would be nice, because this is where the linking to Concepticon would go in. You probably should be using pycldf
to cobble together the CLDF dataset. This would allow you to focus on the data, while the metadata would mostly come for free - i.e. be generated by the package appropriately.
Thanks! So, what would pycldf
do for me that my SPARQL store doesn't?
pycldf
would write the Wordlist-metadata.json
and do that in a way that is fairly easy to customize, e.g. to add the valueUrl
property for the ID
column. But I guess, if you start from examples, e.g. the metadata for IDS, adding in a handful of custom properties by hand wouldn't be much of a problem either.
This list is currently digitized in this project: http://www.bates.org.au/
we should definitely link to the website and the concepts (a bit cryptic), but including the original source would also be nice. It is still not entirely clear how many concepts they used.