buda-base / xmltoldmigration

App to migrate from TBRC XML files to BDRC RDF LD
Apache License 2.0
0 stars 2 forks source link

migrating to new log entries #127

Closed eroux closed 3 years ago

eroux commented 4 years ago

@xristy I'm going to need your help on this... Here are a few interesting things to look at:

related to https://github.com/buda-base/owl-schema/issues/170

xristy commented 4 years ago

I0886 has a qc date, I think we could give it its own log entry, which could be a new adm:ContentQCLogEntry?

That can be done; however, that info was not collected after the earliest volumes were scanned since formal attempts at qc'ing were abandoned

I0886 and W22084 have a first log entry that clearly doesn't represent their creation... I'm thinking maybe we should add the graph creation log entry only on the RID that are in the new form, like W4CZ309048 (or perhaps the dlms has a default commit message for the creation?)

The dlms editor doesn't have a default creation message.

We could use :notAfter for the old format RIDs. Notice for example that I0866 ... were marked as qc'd in 2003. The records in eXist-db were migrated from the MySql in 2005 or so and there was no creation date kept in MySql.

still in I0886, I think the line Satluj Siti Enterprises</ig:scanning> could be transformed into a content creation log entry with a log message (and no date)?

You could do that. I don't know how helpful it would be.

what user is the sync script? am I right to say that all the entries with who="Imagegroups Updater" are sync entries?

Yes those are sync entries

Has the user ever changed?

Yes. I've changed the user several times. Most recently it seemed a good idea to use the name of the script when applicable so I'd have some idea what code did the work.

I don't think the message Updated total pages is an unambiguous indication of the sync process as some other scripts have the same message, like in I0886

The phrase "Updated total pages" is only used for sync'ing. It sometimes happens that a sync has run at two different times. Usually to correct for some error - image reordering or duplicate deletion, etc.

eroux commented 4 years ago

That can be done; however, that info was not collected after the earliest volumes were scanned since formal attempts at qc'ing were abandoned

ok yes, let's do that then

Let's give up on the not after for the creation... if we don't know we don't know, that's fine

You could do that. I don't know how helpful it would be.

Well, we would know that the scanning was made by Satluj Siti Enterprises. But actually this info is more commonly put in the scanInfo field... maybe I should just put it there...?

The phrase "Updated total pages" is only used for sync'ing. It sometimes happens that a sync has run at two different times. Usually to correct for some error - image reordering or duplicate deletion, etc.

Oh ok! So in that case there were two syncs, one in 2016 and one in 2020? Or three counting Updated total pages and added page list?

Finally it seems to me that these entries:

<entry when="2016-03-30T12:20:30.571-04:00" who="Code Ferret">Updated total pages</entry>
<entry when="2016-03-31T17:27:09.458-04:00" who="Code Ferret">Added ondisk</entry>

are common to all the works that were on tbrc.org around 2016 right? (I'm not sure what to do with that info, but... that's something)

xristy commented 4 years ago

I think putting the scanning text from the imagegroup in the bdo:scanInfo on Instances makes most sense.

Oh ok! So in that case there were two syncs, one in 2016 and one in 2020? Or three counting Updated total pages and added page list?

The phrase "added page list" was transient on the imagegroup-updater scripts since we were still modifying the sync flow.

Initially in late March of 2016 there was a batch run just to add total page info for some works from a csv. Then just onDisk info was added with no page list:

<ig:description type="ondisk"/>

And a bit later imagegroup-updater was modified to have:

<ig:description type="ondisk">{$pageList}</ig:description>

Along with updating the total pages as in:

<ig:images total="624" tbrcintro="2" intro="0" text="620"/>
eroux commented 4 years ago

Ok! So, to summarize and check that I understood correctly:

<entry when="2016-03-30T12:20:30.571-04:00" who="Code Ferret">Updated total pages</entry>

is sync,

<entry when="2016-03-31T17:27:09.458-04:00" who="Code Ferret">Added ondisk</entry>

is graph modification,

<entry when="2016-04-28T23:50:58.855Z" who="Code Ferret">Updated total pages and added page list</entry>

is graph modification

<entry when="2020-01-08T22:37:32.318Z" who="Imagegroups Updater">Updated total pages</entry>

is sync.

Does it sound right?

xristy commented 4 years ago

<entry when="2016-03-30T12:20:30.571-04:00" who="Code Ferret">Updated total pages</entry> is sync,

it was an early updating of some bookkeeping in the imagegroup record. The actual sync'ing of images to archive had already happened sometime in the past. The March 2016 timeframe was when we started encoding info in the imagegroup record to support better S3 usage. In the days before using AWS the app server simply interrogated the local disk for whether images were present or not for a given volume. There was a, relatively short, period of developing the info that would be recorded in the imagegroup records to avoid treating S3 as a local disk which is why there are a bunch of odd records around that time.

<entry when="2016-03-31T17:27:09.458-04:00" who="Code Ferret">Added ondisk</entry> is graph modification,

This and the above can be considered graph modification or an upper bound on the sync date (which may or not be useful)

<entry when="2016-04-28T23:50:58.855Z" who="Code Ferret">Updated total pages and added page list</entry> is graph modification

by this point these should be syncs. The imagegroup-updater sets the onDisk, page list, and total page info at the tme of sync'ing

<entry when="2020-01-08T22:37:32.318Z" who="Imagegroups Updater">Updated total pages</entry> is sync.

This is the same as the just previous record except I substituted the script info instead of my username and truncated the message.

I'm sorry this is confused, but there was a bit of implementing part of the system while it was in production. It's taken a bit of today for me to dredge up the memories.

eroux commented 4 years ago

Excellent, thanks!

eroux commented 4 years ago

What about the following algorithm:

type = graphupdate
if message.startswith("Updated total pages"):
   if date not in [ "2016-03-31T17:27:09.458-04:00", "2016-04-28T23:50:58.855Z", "2016-04-28T23:50:58.855Z"]:
       type = sync

?

I think the "2016-04-28T23:50:58.855Z" entry is not really a sync neither as it seems to have occured in all image groups, for instance I3207

xristy commented 4 years ago

Yes that's apparently part of the attempt to get the eXist-db consistent with the S3 archive

eroux commented 4 years ago

ok, I'll implement that!

eroux commented 4 years ago

@xristy I think we should have a better detection of the moment of creation of a graph/file... I'm wondering if this algorithm would work:

if RID matches "^[A-Z]+\d+$":
   # example: W22084, denoting a pre-dlms record?
   if first_log_entry starts with "new " or "create" or "add new ":
       first_log_entry_type = creation
else:
   # example W2PD17468, denoting a dlms record?
   first_log_entry_type = creation

would that work?

xristy commented 4 years ago

I really don't know if that works. Do you have some examples? I don't think we have any graph creation info for pre-dlms records. It seems to me we can just use the adm:UpdateGraph and remain silent on creation of those records unless you want to use :notAfter based on the earliest dated info.

eroux commented 4 years ago

hmmm I think it's what I'm doing in my algorithm? I guess another way to ask the same question is:

Am I right to say that:

eroux commented 4 years ago

just to clarify, ^[A-Z]+\d+$ matches RIDs with some letters at the beginning, and then only numbers, such as W22084 or P1583

xristy commented 4 years ago

Yes I understand the regex.

I was not agreeing that the earliest logentry represents creation of the record.

If there must be a creation record then it should read like

  "created before adm:LogDate of this adm:CreateGraph entry"
eroux commented 4 years ago

Here are some interesting exceptions to our model / migration:

eroux commented 4 years ago

Also, in order to have better data we should be able to distinguish between a batch update / import and a manual modification... I think having a adm:BatchUpdateGraph would be enough... wdyt?

xristy commented 4 years ago

Distinctions:

Since I am not clear on the extent of what you envision as applications of the recording of various sort of distinctions about the evolution of the metadata (not to mention the contents: image, etext or perhaps pictorial, e.g., tsakle, contents), I'm not sure to what degree distinguishing among representations is helpful or not.

We can approximate the representations by saying that old-style RIDs were originally recorded as tabular data and the new style RIDs as xml records - there are as yet no metadata entities recorded directly as rdf graphs by our staff although there are some rdf graph representations that are imported from third-parties but do we know whether they were initially recorded as rdf graphs or some other representation and then re-represented as rdf graphs?

We could eschew graph in favor of metadata. So adm: MetadataInitiallyRecorded instead of adm:CreateGraph; and, indicate adm:MetadataInitiallyRecorded at a particular date-time by some agent (person or script) in some representation (sql table or spreadsheet can both count as tabular) with some comment.

For imagegroups or other embedded entities such as occur w/ outlines we could use adm:EmbeddedMetadataInitiallyRecorded or adopt a convention that adm:MetadataInitiallyRecorded is used for entities whether embedded or not.

There may be a useful distinction between the creation/importation/recording of a singleton/manual versus batch of / mechanically derived distinct clusters of metadata. This suggests an additional property in a log entry to distinguish between the user and agent. So we would be able to express that:

Previously, the pubinfo metadata is usually recorded together with a work's metadata and often published at the same time, for example:

2017-01-04T14:35:41.680Z for pubinfo
2017-01-04T14:35:41.955Z for work

but for completeness it seems reasonable to include the creation/importation/recording of the metadata in the (graph) representation.

Since the pubinfo is the basis of Instance metadata versus Work metadata and the two sorts of metadata are generally in distinct graphs it seems appropriate to provide information about the initial recording in each of the graphs

xristy commented 4 years ago

I should have added things like adm:MetadataUpdated, adm:MetadataWithdrawn and so on.

xristy commented 4 years ago

The batch versus single entity distinction is established via the agent

xristy commented 4 years ago
eroux commented 4 years ago
xristy commented 4 years ago

There are various observables regarding an entity, such as weight and linear dimensions. There are also items of convention such as birth-date, name and title, students, authors and so on. We can agree to refer to all of this as data and then bits such as log-entries and repo-id and version-id are metadata. So only adm: is metadata.

eroux commented 4 years ago

I'm almost done with the changes but there is still one case that we should clarify: outlines. They have their own log entries and they are currently merged with the instance (because the two entities merged). It's not totally unreasonable, but I'm thinking we should distinguish the log entries in order to attribute things properly... I think if we have something like

adm:UpdateOutlineData rdfs:subClassOf adm:UpdateData .
adm:CreateOutlineData rdfs:subClassOf adm:UpdateData .

(but the second could also be a subclass of adm:InitialDataCreation). wdyt?

xristy commented 4 years ago

I'll add

adm:UpdateOutlineData rdfs:subClassOf adm:UpdateData .
adm:InitialOutlineData rdfs:subClassOf adm:InitialDataCreation .
eroux commented 4 years ago

Doing the code I realized adm:InitialOutlineDataImport is also necessary

xristy commented 4 years ago

I was thinking that also

adm:InitialOutlineDataImport rdfs:subClassOf adm:InitialDataImport .
xristy commented 4 years ago

Just checking the migration to testrw I see at least one problem. In bdg:W1 I see:

bda:LGIGS001  a         adm:UpdateData ;
        adm:logDate     "2016-03-30T16:20:30.571Z"^^xsd:dateTime ;
        adm:logMessage  "Updated total pages"@en ;
        adm:logMethod   bda:BatchMethod ;
        adm:logWho      <http://purl.bdrc.io/resource-nc/user/U00006> .
bda:LGIGS002  a         adm:UpdateData ;
        adm:logDate     "2016-03-31T21:27:09.458Z"^^xsd:dateTime ;
        adm:logMessage  "Added ondisk"@en ;
        adm:logMethod   bda:BatchMethod ;
        adm:logWho      <http://purl.bdrc.io/resource-nc/user/U00006> .
bda:LGIGS003  a         adm:UpdateData ;
        adm:logDate     "2016-04-28T23:50:58.855Z"^^xsd:dateTime ;
        adm:logMessage  "Updated total pages and added page list"@en ;
        adm:logMethod   bda:BatchMethod ;
        adm:logWho      <http://purl.bdrc.io/resource-nc/user/U00006> .

and in W01CT0060

bda:LGIGS001  a  adm:UpdateData .

bda:LGIGS002  a         adm:UpdateData ;
        adm:logDate     "2016-03-31T21:27:09.458Z"^^xsd:dateTime ;
        adm:logMessage  "Added ondisk"@en ;
        adm:logMethod   bda:BatchMethod ;
        adm:logWho      <http://purl.bdrc.io/resource-nc/user/U00006> .

This is going to make a bit of a mess I fear.

eroux commented 4 years ago

Hmm it's on purpose... what kind of mess are you anticipating?

xristy commented 4 years ago

I guess its fine it you intend it. You'll know what sort of queries to write and so on. Most of the log entries in instances and persons and so on use a 16 hex-digit id along with many other internal ids. So these caught my attention since they obviously aren't trying to be unique except within a given graph

What are you encoding in the ids this way?

eroux commented 4 years ago

In the id of the log entries? Nothing, I just generate one id per batch change on the same timestamp

xristy commented 4 years ago

ok. That doesn't seem true for bda:LGIGS001. I'll drop it. It was just unexpected to me.

eroux commented 4 years ago

Oh I see what you mean now... sorry, I'm on my phone so things are not very clear, I shall inspect

eroux commented 4 years ago

hmmm I just checked and the file on the git repo on buda2 seems fine (/bdrc-git-repos/iinstances/36/W01CT0060.trig)... and I see the expected

bda:LGIGS001  a         adm:UpdateData ;
        adm:logDate     "2016-03-30T16:20:30.571Z"^^xsd:dateTime ;
        adm:logMessage  "Updated total pages"@en ;
        adm:logMethod   bda:BatchMethod ;
        adm:logWho      bdu:U00006 .

when I query

construct {?s ?p ?o} where {
  graph <http://purl.bdrc.io/graph/W01CT0060> { ?s ?p ?o }
}

... how did you get the empty

bda:LGIGS001  a  adm:UpdateData .

?

xristy commented 4 years ago

I ran the same query

xristy commented 4 years ago

Ah! I had the limit set too low and didn't get the whole graph.

eroux commented 4 years ago

aaaaaah I see! interesting

eroux commented 4 years ago

There are a few other patterns that could be handled better by the code (although they're mostly fine now):

@xristy do you see other cases of batch imports?

eroux commented 4 years ago

Another small bug: due to inconsistencies in the date format of the qc element, the current code sometimes migrates days as month, as in https://www.tbrc.org/xmldoc?rid=I4138

eroux commented 4 years ago

I wonder also how the NLM records should be attributed...

xristy commented 4 years ago

All of the NLM records have been imported from csv via

/db/modules/admin/works-nlm-import.xql

and marked as such like:

<entry when="2018-10-28T18:29:33.12Z" who="NLM Importer">imported W1NLM1</entry>

You may refer to nlm for the logs from various test and import runs.

The attribution is in the catalogInfo.

eroux commented 4 years ago

Yes, I meant something prior to that... like which librarian created the line of the csv that we imported... but maybe we don't have this data? or is the csv produced in Mongolia?

xristy commented 4 years ago

There may be some columns in the master google sheets recording info about staff at the NLM that created particular entries.

@TBRC-Travis is the keeper of the master google sheets that he uses to set up batches for import. The master sheets do have a column for a Mongolian person and a date of entry but from the old versions that I have the date was rarely if ever filled out and the person info was deemed at the time the importing was designed to not be included in our records.

I can no longer find the big giant master sheet on google drive. I'm sure @TBRC-Travis can point to it.

xristy commented 4 years ago

As for other batch importing I really don't recall others. From the info I see in the L1RKLnnn I had well forgotten that Ralf's work was originally dealt with in 2007 by Jeff. I think there was more work on the schema done later.

eroux commented 4 years ago

ok perfect! I just wasn't sure who was producing the csv, but it's fine this way