CopticScriptorium / cts

Coptic Scriptorium's website for reading digitized Coptic texts and CTS URN resolution
http://data.copticscriptorium.org
Apache License 2.0
2 stars 3 forks source link

Filter for corpus sahidica.mark produces inaccurate results #71

Closed ctschroeder closed 9 years ago

ctschroeder commented 9 years ago

Filtering for the corpus sahidica.mark should list chapters 1-6 and 12-16 of the Gospel of Mark from the manually edited Mark Corpus. It does not. It erroneously also lists one chapter of 1Cor from the 1cor manually edited corpus and Mark chapters 12-16 of the unedited sahidica.nt corpus.

There is a basic problem with the ingest of metadata or something going on here.

lukehollis commented 9 years ago

Can you double check the ANNIS metadata here? I'm seeing 1Cor_01 returning a corpus metadata value of "sahidica.mark": http://corpling.uis.georgetown.edu/annis-service/annis/meta/doc/sahidica.1corinthians/1Cor_01

ctschroeder commented 9 years ago

oh yeah lookie lookie there

ctschroeder commented 9 years ago

wait -- reassigned too soon -- what's going on with the Mark chapters 12-16 of the unedited sahidica.nt corpus? Why are they there?

amir-zeldes commented 9 years ago

What I think is happening is that multiple corpora contain documents with the same name, and then they both get each other's metadata. For example, both sahidica.nt and sahidica.mark contain a document called Mark_16. But since they come from different corpora, these documents should be kept distinct, and their metadata is not the same (the sahidica.nt version was not annotated by Rebecca Krawiec, but sahidica.mark was).

The ingested documents should be stored with the entire path they were brought from (sahidica.mark > Mark > Mark_16)

lukehollis commented 9 years ago

I have a fix in place for the Mark issues--rerunning the ingest now.

ctschroeder commented 9 years ago

@amir-zeldes 1Cor chapters 1-9 are ready for republication; I fixed the metadata field in the 1 Cor 1 document. Oddly, the incorrect metadata field was for corpus but in the corpus level metadata not the document level metadata.