Open ross-spencer opened 3 years ago
@emulatingkat I suspect with the new data I just need to create a new heuristic that says, if TrID then ignore referenceDate
but what do you think? I presume this information isn't available because we don't know when it was added to TrID? Though on PRONOM I believe reference date is usually the date it was added to PRONOM, so I am wondering if there is still a window to add this information to Wikidata for TrID?
Although looking again @emulatingkat it seems like date might be available for formats in TrID, e.g. Progress V9 (2017):
<TrID ver="2.00">
<Info>
<FileType>PROGRESS Procedure Library (v9)</FileType>
<Ext>PL</Ext>
<Mime>application/octet-stream</Mime>
<ExtraInfo>
<Rem></Rem>
<RefURL>http://progress-tools.x10.mx/winpl.html</RefURL>
</ExtraInfo>
<User>Marco Pontello</User>
<E-Mail>marcopon@gmail.com</E-Mail>
<Home>http://mark0.net</Home>
</Info>
<General>
<FileNum>5</FileNum>
<Date>
<Year>2017</Year>
<Month>3</Month>
<Day>19</Day>
</Date>
<Time>
<Hour>1</Hour>
<Min>37</Min>
<Sec>23</Sec>
</Time>
<Creator>TrIDScan/Py v2.02</Creator>
</General>
<FrontBlock>
<Pattern>
<Bytes>D7077</Bytes>
<Pos>0</Pos>
</Pattern>
</FrontBlock>
</TrID>
Or randomly zcalc (2007):
<TrID ver="2.00">
<Info>
<FileType>zCalc data</FileType>
<Ext></Ext>
<ExtraInfo>
<Rem></Rem>
<RefURL>http://www.zcalc.com</RefURL>
</ExtraInfo>
<User>Marco Pontello</User>
<E-Mail>marcopon@gmail.com</E-Mail>
<Home>http://mark0.net</Home> </Info>
<General>
<FileNum>12</FileNum>
<CheckStrings>True</CheckStrings>
<Date>
<Year>2007</Year>
<Month>09</Month>
<Day>21</Day>
</Date>
<Time>
<Hour>16</Hour>
<Min>36</Min>
<Sec>26</Sec>
</Time>
<Creator>TrIDScan32 v1.56</Creator>
</General>
<FrontBlock>
<Pattern>
<Bytes>C045</Bytes>
<ASCII> . E</ASCII>
<Pos>0</Pos>
</Pattern>
<Pattern>
<Bytes>64</Bytes>
<ASCII> d</ASCII>
<Pos>3</Pos>
</Pattern>
</FrontBlock>
<GlobalStrings>
<String>ZCALCPERSISTENTDEPENDENCIES</String>
<String>FORM</String>
<String>DATE</String>
<String>TEXT</String>
<String>TYPE</String>
<String>MAIN</String>
</GlobalStrings>
</TrID>
Is there room still to think about adding this information?
By the way, it's really cool to see all the TrID signatures in Wikidata. Nice work! This is just the first time I'm seeing it in Siegfried.
There is still room to add this information. I'd be happy to. Would it be possible for you to provide an example of provenance that works for you? I want to be sure I express the provenance as expected.
Thank @emulatingkat I describe the provenance in a little more detail here: https://github.com/richardlehane/siegfried/issues/160
So, we're anticipating stated in and retrieved.
So, we're just missing the date for TrID, and per the issue above, I am thinking about removing the restriction of having both, but I'm not sure what you think? Adding the date is especially helpful for TrID because the number of records is so high, and if date is important to us, then we wouldn't want the large number of date warnings for it not being there.
There are one or two other formats that have slightly different reference information like SLOB: https://www.wikidata.org/wiki/Q98923420 but I am thinking about adding a stated in
property for that one. That being said, we haven't really discussed it in more detail - stated in
and retrieved
were simply the most consistent values across Wikidata for the first iteration of this.
I think we will need to reflect on the set of possible ways for Wikidata to express provenance. It is likely that there is not one single pattern. For example, retrieved (P813) is used to indicate the day the Wikidata editor visited Trid to find this information. If we want to express the date provided in the Trid record, I think "last update" P5017 would be more suitable. I recognize that having multiple patterns makes work on your end more complicated. Happy to discuss additional possibilities.
That makes sense @emulatingkat. Both properties seem really useful. I'm not sure how much it affects what we want from Siegfried. A discussion would be good. My feeling is last update is what we really want to report on and retrieved is helpful for other provenance reasons but might not be the date shown in the Siegfried report.
Unfortunately that does mean that most of the dates we have aren't what we need and so making a decision, documenting that, and figuring out how to record that info for the legacy records might be good next steps. For TrID if we have access to that data for last update then it does seem like a good no-regrets option - and retrieved I imagine could be added as part of any script that does that?
Good points. Sounds good to me.
Description of problem
The new TrID patterns in Wikidata have no provenance date which results in a large number of linting messages in the Siegfried Wikidata identifier (~9000). Example below. As this affects the quality of the linting messages in Wikidata a decision should be reached about what we can do here.
Option seem to be:
Permalink