HughP / olac

Automatically exported from code.google.com/p/olac
0 stars 0 forks source link

Proposed: LanguageCommons #192

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
http://www.archive.org/details/LanguageCommons

This repository is hosted in the Internet Archive.  We need to harvest its 
metadata as we already do for the Rosetta Project:
http://language-archives.org/archive/rosettaproject.org

Original issue reported on code.google.com by StevenBird1 on 2 Sep 2010 at 12:16

GoogleCodeExporter commented 9 years ago
Sorry, I don't know how it was done for the Rosetta Project. Are we supposed to 
develop and maintain the transformation service that turns the site into an 
OLAC data provider or static repository? Or, will the data provider/static 
repository be given to us?

Original comment by haepal on 2 Sep 2010 at 1:24

GoogleCodeExporter commented 9 years ago
In issue 193 I've explained how we did the Rosetta Project repository.  With 
the result from issue 193 in hand, we'll be able to capture a Language Commons 
repository. However, note that Language Commons submitters must do more when 
entering metadata.  There are currently two records:

http://www.archive.org/services/oai2.php?verb=ListRecords&metadataPrefix=oai_dc&
set=collection:LanguageCommons

One has <dc:language>en</dc:language> and the other has no language, so we 
can't put it on a language index page, which makes the resource invisible to 
those who might care about it.  

What the Rosetta Project collection has done is to use both <dc:language> and 
<dc:subject> with three-letter ISO language codes, which gives us rich metadata 
for OLAC purposes.  We will have to find a way to get Language Commons 
submitters to supply the appropriate language codes for both <dc:language> and 
<dc:subject>.

Original comment by garyfsim...@gmail.com on 2 Sep 2010 at 10:35

GoogleCodeExporter commented 9 years ago
The process of approving submissions can ensure we have language codes in the 
dc:language and dc:subject element.  Where there are multiple languages, they 
will be comma separated.

Original comment by StevenBird1 on 6 Sep 2010 at 4:34

GoogleCodeExporter commented 9 years ago
Based on what have been done for the Rosetta Project repository, a harvester 
and an XSL stylesheet have been written.

Using those, a Language Commons static repository has been created and 
registered.

Original comment by haepal on 25 Oct 2010 at 2:43

GoogleCodeExporter commented 9 years ago
I've just accepted the registration since it all looks valid.  However, the 
metadata could be improved.  The biggest thing missing is that both of the 
resources in the collection really want to have:

   <dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>

so that they will emerge from the thousands of "other resources" for English 
and be findable as text corpora in the faceted search.

Is there a clue in the OAI metadata that is coming out of the Internet Archive, 
or does something need to be added to the guidelines that the Language Commons 
gives to data providers?

Original comment by garyfsim...@gmail.com on 25 Oct 2010 at 4:44

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
If the language or subject consists of exactly 2 or 3 letters, specify 
olac-language scheme.  Permit comma-separated values for multiple languages.

Original comment by StevenBird1 on 27 Oct 2010 at 9:25

GoogleCodeExporter commented 9 years ago
Fixed (see revision 1528).

Original comment by haepal on 17 Nov 2010 at 4:54

GoogleCodeExporter commented 9 years ago
This point from comment 7 does not appear to be implemented yet: "If the 
language or subject consists of exactly 2 or 3 letters, specify olac-language 
scheme."

Original comment by garyfsim...@gmail.com on 28 Nov 2010 at 2:48

GoogleCodeExporter commented 9 years ago
The static repository xml file itself hadn't been updated. Just updated the 
file which will be re-harvested soon.

Original comment by haepal on 6 Dec 2010 at 3:39

GoogleCodeExporter commented 9 years ago
Repository was purged in the database due to the wrong BaseURL. The stylesheet 
has been fixed to correct this and the repository xml file has been fixed.

Original comment by haepal on 7 Dec 2010 at 5:38

GoogleCodeExporter commented 9 years ago

Original comment by haepal on 8 Dec 2010 at 2:58

GoogleCodeExporter commented 9 years ago
The baseurl for the Language Commons has been submitted and approved some days 
ago, but the new records are not showing up in OLAC search.  The new URL, shown 
in archive_review.php, is http://upload.languagecommons.org/sr .  However, the 
registered URL, shown in 
http://www.language-archives.org/archive/languagecommons.org, still seems to be 
http://www.language-archives.org/hosted/languagecommons.org.xml . 

Original comment by StevenBird1 on 22 Feb 2011 at 12:05

GoogleCodeExporter commented 9 years ago
Found that the harvester process was being blocked since Feb 6. Killed the 
process, and will check tomorrow whether it has run and whether the new URL has 
been harvested.

Original comment by haepal on 22 Feb 2011 at 9:32

GoogleCodeExporter commented 9 years ago
The harvester cron job harvested the new URL successfully.

Original comment by haepal on 23 Feb 2011 at 2:43

GoogleCodeExporter commented 9 years ago

Original comment by StevenBird1 on 28 Feb 2011 at 3:04