denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

record_created set to wrong value #44

Closed gjost closed 7 years ago

gjost commented 7 years ago

philip @ 7/14/2017 9:54AM

The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.

geoffJ @ 7/14/2017 10:33AM

Second odd thing - and I am not sure that this is related. The [Record Created] field for ddr-testing-40003 says it was created Thursday 13 July 2015, 03:15 PM PDT - which I believe to be correct. However the [Record Created] field for object ddr-testing-40003-1 says it was created Thursday 13 July 2015, 01:48 PM PDT - which is not possible - I was either at lunch or had just gotten back and was not working in the testing repos anyway. Not sure how problematic this potentially could be, just something of note.

Before we launch into conspiracy theories, check the datetime on that VM, it's probably off. Whenever you close a VM, saving the state, and then wake it up again you'll get clock drift. In my dev environment I always restart NTP after waking up a VM ("sudo service ntp restart") or else all my Git commit timestamps are off. In the Densho HQ it's probably best to shutdown VMs rather than saving state. Also, the host machine clock could be off.

cameron @ 7/14/2017 11:01AM

Update on the time error:

Philip and I noticed that in multiple collections, the [Record Creation] time-stamp in editor is subtracting a fixed amount of time from the initial time that the entire collection was created and then applying that resulting time signature globally across all of the collections' objects.

A formulaic depiction of the error is as follows:

(Valid Collection Creation Time) - n = Globally applied object [Record Creation] value.

  -Philip and I have looked at collections within ddr-testing and ddr-densho and have seen this popping up in numerous collections' objects, although the value of n differs across certain date-ranges of the collections' creation. 

 -Time drift has ranged from a few minutes to multiple days

*also, for time error, I should note that we checked the VM clocks and they were valid.

gjost @ 11:20am

Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.

gjost @ 11:31am

Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.

One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two.

Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.

philip @ 11:31am

Maybe you're looking at collections that were imported into the DDR system from CSV files (I don't have a list but Senor Froh probly does). You'd see a collection created at one time and then large numbers of entities or files all created at the same time.

this might possibly exist in older collections, but I am seeing this in collections which I know were not imported from a CSV ddr-densho-332, which was made in May of 2017 and ddr-testing-40003 which was made yesterday. ddr-densho-330 (which was made in Feb and was worked on from March through April) also has timestamp differences, though the differences are not the same for every entity (ranges from two minutes to 50+ hours).

froh @ 11:44am

There are definitely a large number of collections -- especially the earlier ddr-densho repos, most of the ddr-njpa, and a bunch of others -- that were originally created from import. The collection repo itself would have been created first -- i.e., collection.json -- then the entities would have be imported from csv.

Can you give me a sample list of a couple of the collections where you see this behavior?

And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?

philip @ 11:45am

Were ddr-testing-40003 and ddr-testing-40003-1 created on/from the same machine? Your setup there is for everything to be stored on the Qumulo but the JSON files (and the timestamps) come from whatever VM is writing them.

Yes, they were made on the same machine and at this time all the machines are running the same VM appliance.

One possiblity is that the background process (celery) that actually writes and commits most files might be lagging but I wouldn't expect it do lag by more than a minute or two. Also, there's a changelog for every entity in the entity's folder (the file is called "changelog"). Every modification that's done through the app should be recorded there*.

looking at the entity changelog for ddr-testing-40003-1 it has the same initialization time that the collection changelog has: Thu, 13 July 2017, 03:18 PM PDT. the [Record Created] field in Editor says Thu, 13 July 2017, 01:48 PM PDT

  • Manual modifications (i.e. using a text editor) are usually not recorded unless somebody also changes the changelog. Batch operations may also neglect to update the changelog.

Also, looking at older collection/entity changelogs, the initialization/update time/date format is different. on newer collections it is (example from ddr-testing-40003-1 entity initialization): Thu, 13 July 2017, 03:18 PM PDT on older collections it is (example from ddr-densho-303 collection initialization): Mon, 22 Jun 2017 11:40:50 (note that the older format is on a 24-hour clock)

philip @ 11:52am

Can you give me a sample list of a couple of the collections where you see this behavior?

as in collections imported from a csv exhibiting differences in timestamps? or just collections with differences in timestamps? if you mean the later: ddr-densho-303, ddr-densho-330, ddr-densho-332, ddr-densho-325, and all of the ddr-testing collections so far.

And just to clarify, this isn't happening when you create new objects through the webui in a new collection, right?

this IS happening with new objects in new collections through the webui

gjost commented 7 years ago

After a deep-dive code review we established that ddr-defs sets default values for record_created and record_lastmod to datetime.now(). The effect is that record_created is... whatever time you last started the app and caused the model definitions files to be read. [facepalm]

We need to a) remove the default value from the models definitions. b) find the right place to run datetime.now(), either when the object is instantiated or when it is written for the first time.

We also need to, for each object in each collection, consult git-log and replace record_created with whatever we find there.