ffdev-info / wikidp-issues

An issues repository for resolving issues in Wikidata around the records relating to Digital Preservation
GNU General Public License v3.0
1 stars 0 forks source link

TrID reference details have no retrieved by date #23

Open ross-spencer opened 3 years ago

ross-spencer commented 3 years ago

Description of problem

The new TrID patterns in Wikidata have no provenance date which results in a large number of linting messages in the Siegfried Wikidata identifier (~9000). Example below. As this affects the quality of the linting messages in Wikidata a decision should be reached about what we can do here.

Option seem to be:

Permalink

ross-spencer commented 3 years ago

@emulatingkat I suspect with the new data I just need to create a new heuristic that says, if TrID then ignore referenceDate but what do you think? I presume this information isn't available because we don't know when it was added to TrID? Though on PRONOM I believe reference date is usually the date it was added to PRONOM, so I am wondering if there is still a window to add this information to Wikidata for TrID?

ross-spencer commented 3 years ago

Related to https://github.com/richardlehane/siegfried/issues/160

ross-spencer commented 3 years ago

Although looking again @emulatingkat it seems like date might be available for formats in TrID, e.g. Progress V9 (2017):

<TrID ver="2.00">
    <Info>
        <FileType>PROGRESS Procedure Library (v9)</FileType>
        <Ext>PL</Ext>
        <Mime>application/octet-stream</Mime>
        <ExtraInfo>
            <Rem></Rem>
            <RefURL>http://progress-tools.x10.mx/winpl.html</RefURL>
        </ExtraInfo>
        <User>Marco Pontello</User>
        <E-Mail>marcopon@gmail.com</E-Mail>
        <Home>http://mark0.net</Home>
    </Info>
    <General>
        <FileNum>5</FileNum>
        <Date>
            <Year>2017</Year>
            <Month>3</Month>
            <Day>19</Day>
        </Date>
        <Time>
            <Hour>1</Hour>
            <Min>37</Min>
            <Sec>23</Sec>
        </Time>
        <Creator>TrIDScan/Py v2.02</Creator>
    </General>
    <FrontBlock>
        <Pattern>
            <Bytes>D7077</Bytes>
            <Pos>0</Pos>
        </Pattern>
    </FrontBlock>
</TrID>

Or randomly zcalc (2007):

<TrID ver="2.00">
    <Info>
        <FileType>zCalc data</FileType>
        <Ext></Ext>
        <ExtraInfo>
            <Rem></Rem>
            <RefURL>http://www.zcalc.com</RefURL>
        </ExtraInfo>
        <User>Marco Pontello</User>
        <E-Mail>marcopon@gmail.com</E-Mail>
        <Home>http://mark0.net</Home>   </Info>
    <General>
        <FileNum>12</FileNum>
        <CheckStrings>True</CheckStrings>
        <Date>
            <Year>2007</Year>
            <Month>09</Month>
            <Day>21</Day>
        </Date>
        <Time>
            <Hour>16</Hour>
            <Min>36</Min>
            <Sec>26</Sec>
        </Time>
        <Creator>TrIDScan32 v1.56</Creator>
    </General>
    <FrontBlock>
        <Pattern>
            <Bytes>C045</Bytes>
            <ASCII> . E</ASCII>
            <Pos>0</Pos>
        </Pattern>
        <Pattern>
            <Bytes>64</Bytes>
            <ASCII> d</ASCII>
            <Pos>3</Pos>
        </Pattern>
    </FrontBlock>
    <GlobalStrings>
        <String>ZCALCPERSISTENTDEPENDENCIES</String>
        <String>FORM</String>
        <String>DATE</String>
        <String>TEXT</String>
        <String>TYPE</String>
        <String>MAIN</String>
    </GlobalStrings>
</TrID>

Is there room still to think about adding this information?

By the way, it's really cool to see all the TrID signatures in Wikidata. Nice work! This is just the first time I'm seeing it in Siegfried.

emulatingkat commented 3 years ago

There is still room to add this information. I'd be happy to. Would it be possible for you to provide an example of provenance that works for you? I want to be sure I express the provenance as expected.

ross-spencer commented 3 years ago

Thank @emulatingkat I describe the provenance in a little more detail here: https://github.com/richardlehane/siegfried/issues/160

So, we're anticipating stated in and retrieved.

So, we're just missing the date for TrID, and per the issue above, I am thinking about removing the restriction of having both, but I'm not sure what you think? Adding the date is especially helpful for TrID because the number of records is so high, and if date is important to us, then we wouldn't want the large number of date warnings for it not being there.

There are one or two other formats that have slightly different reference information like SLOB: https://www.wikidata.org/wiki/Q98923420 but I am thinking about adding a stated in property for that one. That being said, we haven't really discussed it in more detail - stated in and retrieved were simply the most consistent values across Wikidata for the first iteration of this.

emulatingkat commented 3 years ago

I think we will need to reflect on the set of possible ways for Wikidata to express provenance. It is likely that there is not one single pattern. For example, retrieved (P813) is used to indicate the day the Wikidata editor visited Trid to find this information. If we want to express the date provided in the Trid record, I think "last update" P5017 would be more suitable. I recognize that having multiple patterns makes work on your end more complicated. Happy to discuss additional possibilities.

ross-spencer commented 3 years ago

That makes sense @emulatingkat. Both properties seem really useful. I'm not sure how much it affects what we want from Siegfried. A discussion would be good. My feeling is last update is what we really want to report on and retrieved is helpful for other provenance reasons but might not be the date shown in the Siegfried report.

Unfortunately that does mean that most of the dates we have aren't what we need and so making a decision, documenting that, and figuring out how to record that info for the legacy records might be good next steps. For TrID if we have access to that data for last update then it does seem like a good no-regrets option - and retrieved I imagine could be added as part of any script that does that?

emulatingkat commented 3 years ago

Good points. Sounds good to me.