erc-releven / .github

1 stars 0 forks source link

Import death factoids to RDF data #1

Open tla opened 1 year ago

tla commented 1 year ago

Factoid data attached

c11deaths-AA.xlsx c11deaths-MR.xlsx

tla commented 1 year ago

The fact of the deaths themselves are already in the database; here we are parsing and adding the date information. We can discuss the details further on Wednesday, and make notes in this issue.

Aaleks93 commented 10 months ago

also related to issue #2 revised version of the death factoids, completed. you can access the updated spreadsheet through this link

Aaleks93 commented 8 months ago

The spreadsheet with death records has been updated with sources on which I based the datings where my name is the authority. Therefore, the file from 21.11.2023 has been updated to the file named "C11 PBW Death records, AA_revised version_09.01.2024." xlsx, accessible here https://ucloud.univie.ac.at/index.php/f/797833040

tla commented 7 months ago

Report from @lu-pl 💯 I implemented the table conversion for the editor rows, see example output. The P14 assertion for assigning Aleks or Marton is still missing, will add it today (+ some minor fixes).

Note that some SPARQL queries return empty, in which case no RDF is generated. See the logs. I haven't really looked into that (yet) because I think you said you would like to investigate the empty queries yourself.

lu-pl commented 7 months ago

Update: Implemented the missing P14 assertions, see output.

tla commented 7 months ago

Note that some SPARQL queries return empty, in which case no RDF is generated. See the logs. I haven't really looked into that (yet) because I think you said you would like to investigate the empty queries yourself.

Some of these are expected (where they are based on sources that we ended up not using), but others have to do with the fact that the Name column has something added in parentheses. So for example Ioannes (Smbat) 106 should just be queried as Ioannes 106. I don't know where the parenthetical text came from, but it needs to be stripped / ignored in all cases.

For sanity-checking purposes, it might be helpful to keep a list of the sources we aren't using; these include Council of 1157, Italikos, Niketas Choniates, Historia, Pantokrator Typikon, Prodromos, Historische Gedichte, Tzetzes, Letters at least. If you could implement these as exclusions (i.e. if the Source canonical name is one of these, just skip the row) and output in the log what the source was every time a query returns nothing, this would help me audit a new run.

lu-pl commented 7 months ago

Update:

Parenthetical text in Name fields gets ignored now and unused Source values are skipped (see the log).

The script now generates a trig file deaths.trig with a named graph for every table partition.

I also investigated the empty queries, some of those were caused by typos or incomplete PBW strings in the tables. I queried the store for the correct PBW strings and manually updated the tables in the r11tab/tables/xlsx folder.

For the remaining empty queries in most cases the PBW data is missing in the triplestore, so I don't really know what to do about that.

lu-pl commented 7 months ago

Note: I would like to/will port the metadata schema used in the r11cli application to the table conversion at some point, if that is alright.

tla commented 6 months ago

I've now looked at the empty queries, which have three causes:

PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX star: <https://r11.eu/ns/star/>

select ?pub ?d ?a4 ?e
where { 
    ?a1 a star:E13_crm_P3 ;
        crm:P140_assigned_attribute_to ?d ;
        crm:P141_assigned """She died on a November 1 [shortly after 1100, a year before <Isaakios 61>]"""@en ;
        crm:P14_carried_out_by ?authority ;
        crm:P17_was_motivated_by ?source .
    ?d a crm:E69_Death .
    ?a2 a star:E13_crm_P100 ;
        crm:P140_assigned_attribute_to ?d ;
        crm:P141_assigned ?p .
    ?p a crm:E21_Person .
    ?id a crm:E15_Identifier_Assignment ;
        crm:P140_assigned_attribute_to ?p ;
        crm:P37_assigned ?e42 .
    ?e42 a crm:E42_Identifier ;
         crm:P190_has_symbolic_content "Anna 61" .
    ?a3 a star:E13_lrmoo_R15 ;
        crm:P140_assigned_attribute_to ?pub ;
        crm:P141_assigned ?source .
    ?a4 a star:E13_lrmoo_R24 ;
        crm:P140_assigned_attribute_to ?pubcreation ;
        crm:P141_assigned ?pub ;
        crm:P14_carried_out_by ?e .
    ?e crm:P3_has_note ?editor . 
} limit 1
tla commented 6 months ago

I forgot the fourth case, which was a death record for Symbatios 101 from Iveron 2.178.5; this is from a document in the Iveron archive that was produced in 1098, which is past our cutoff point of 1095.

lu-pl commented 6 months ago

All empty query cases are handled now (see logs and I updated the script to the new metadata schema.

The way this is impemented now, a named named + metadata is generated for every table partition, see deaths.trig. Another option would be to merge all graphs in to a single named graph and generate metadata only for that graph.

lu-pl commented 6 months ago

note: Metadata of course gets generated only once for every software execution, but every named graph is registered as being an output of that software execution, see the metadata graph.

lu-pl commented 6 months ago

The script now produces a single turtle file with all subgraphs merged, see deaths.ttl.

I had to slightly modify the metadata schema, metadata assertions are now pointing to E13 subject nodes instead of named graphs along L11_had_output. Since the range of L11 is D1_Digital_Object this implies (and a reasoner would inference) that E13 assertions are D1s i.e. E73_Information_Objects - which is not wrong but maybe something worth pointing out.

laletuver1 commented 5 months ago

Meeting notes: Lukas has changed the metadata schema, which Tara will put on the Graph database. A new issue might be necessary for converting all old metadata into new metadata schema.

lu-pl commented 3 months ago

Ingested deaths data to https://r11.eu/rdf/resource/deaths.

lu-pl commented 3 months ago

Note: Consolidation/merging of named graphs into another named graph can be automated using SPARQL update (INSERT) requests.

This should be implemented in r11cli.

edit: DROPing a named graph would not be reflected in the merged graph though, so one would need to SPARQL the merged triples out of target graph before deleting the named graph!

delete { ?s ?p ?o . }
where {
    graph <named_graph> {
        ?s ?p ?o .
    }
}

drop graph <named_graph>
tla commented 2 months ago

Hi @lu-pl , concerning the metadata schema, I've just noticed a problem with the timestamps...

star:cd81994d8e a crmdig:D10_Software_Execution ;
    crm:P82_begin_of_the_begin "2024-03-25T08:07:23.267077"^^xsd:dateTime ;

The first issue is that begin_of_the_begin is actually P82a, not P82 itself; the second issue is that a crmdig:D10_Software_Execution is a subclass of E7, not E52, which is what the domain of P82* is supposed to be. So this would need to be rewritten to something like

star:cd81994d8e a crmdig:D10_Software_Execution ;
    crm:P4_has_time-span [ crm:P82a_begin_of_the_begin "2024-03-25T08:07:23.267077"^^xsd:dateTime ] ;
lu-pl commented 2 months ago

hi @tla, the metadata issue should be fixed, see deaths.ttl.

LODKit now has a feature for Ontology derived ClosedNamespaces, so at least typos won't be an issue anymore.