Keeping data synchronous between public version and productive version

tutebatti commented 2 years ago

@rpbarczok and I asked ourselves what the best procedure will be for changing data after the publication. Certainly and as discussed before, there should be no major changes or additions to the data that will be published on DaRUS and as part of the life version. However, if minor corrections should become necessary, we need a consistent work flow to keep both the public version and the productive version on the servers of U Stuttgart synchronous. The main issue I see is that the data has different tag sets, right, @mfranke93?

A description of a possible procedure will need to include (1) technical details, (2) a work flow that is understood also by non-technical persons, and (3) concrete responsibilities for the time being.

mfranke93 commented 2 years ago

To get the "elephant in the room" out of the way: The "minor corrections" part, to me, sounds like there will be no more changes to the production database at VIS either. But that is not what you mean, right? You mean that most database changes will occur only on the production (not productive ;) ) database, but some changes would need to be applied to both.
If you want to reference a dataset at DaRUS, which matches the data in the public version, in the "How to Cite" sections there, then each change to the public database's data would happen in synchrony with an update to the DaRUS data; that is, the workflow would not be to do each minor change (add parentheses to a source name, ...) by itself, but batch-applying changes at specific intervals when you deem it right. This matches what we discussed in November of 2021. Updating datasets on DaRUS is possible. However, it is not something you should do daily, or weekly. The new version will be linked from the old repository, but, importantly, it will get a new DOI, and the old version will always stay accessible. It is a long-term repository after all. That's a good thing: reports etc. using the old data as a basis will still reference the old version (V1 in that case) in DaRUS, which people can still access as well. The only thing to consider is that, when creating a new version in DaRUS and loading that data into the public version, the URLs and version string in "How to Cite" (Google Doc, 2–5) would need to be updated.
Regarding workflow and documentation thereof, that would need to wait until we have discussed all the details. But in general, a workflow for this is already in place for the "first" version of the DaRUS dataset, which is still just waiting for @rpbarczok to give the go-ahead. I am not sure how usable that will ever be by a non-technical person: there are a lot of technical steps involved, and at the end of the day, someone who at least roughly understands what is happening needs to check whether the result is okay before putting it into DaRUS and the public version. Then, the rough workflow would be:
1. Create the DaRUS data dump (there is a script for that).
2. Upload that to DaRUS as a new version.
3. Wait until the new version has been approved.
4. Update the source files regarding the "How to Cite" stuff with the new reference text for DaRUS and the new DOI.
5. Deploy the updated server software to the public server.
6. Take the database dump from DaRUS and apply it to the public database.
Please clarify what you mean by "tag set". There is no such entity, neither by concept or manifested, in the data model. Regarding how the databases differ, refer to #3

tutebatti commented 2 years ago

production (not productive ;) )

What's the difference?

tutebatti commented 2 years ago

... You mean that most database changes will occur only on the production (not productive ;) ) database, but some changes would need to be applied to both.

Exactly.

2. that is, the workflow would not be to do each minor change (add parentheses to a source name, ...) by itself, but batch-applying changes at specific intervals when you deem it right.

Exactly.

3. Regarding workflow and documentation thereof, that would need to wait until we have discussed all the details.

I'm not sure what details you mean.

i. Create the DaRUS data dump (there is a script for that).

This script will always take/dump the DhiMu data only, right?

4. Please clarify what you mean by "tag set". There is no such entity, neither by concept or manifested, in the data model.

What I mean is that the current version running on the VIS server includes all data and has the tags DhiMu and eOC (as well as things such as persons ...) that should not be part of the public version. Maybe I do not understand the technical procedure described under point 3. in your post.

mfranke93 commented 2 years ago

production (not productive ;) )

What's the difference?

Two letters ;) "production" is just the common term used for this (see here). And "productive server", to me, sounds as if the server is doing all the work.

mfranke93 commented 2 years ago

Regarding workflow and documentation thereof, that would need to wait until we have discussed all the details.

I'm not sure what details you mean.

Neither am I. The ball is with you right now. I cannot document any precise workflows until #3 is resolved and I know exactly how the export actually should look like. Further, I would need to actually do the migration to know if everything works. If I wrote the instructions before that, it would be pure guesswork as to whether the described procedures actually work.

i. Create the DaRUS data dump (there is a script for that).

This script will always take/dump the DhiMu data only, right?

Well, yes, because that is what you asked for. But it could be modified quite easily, even by you, to include or exclude other data. You know how the database is structured.

Please clarify what you mean by "tag set". There is no such entity, neither by concept or manifested, in the data model.

What I mean is that the current version running on the VIS server includes all data and has the tags DhiMu and eOC (as well as things such as persons ...) that should not be part of the public version. Maybe I do not understand the technical procedure described under point 3. in your post.

Yes, the whole purpose of the script is to remove data that should not be in the DaRUS dump and the public version. Everything else than the "data cleaning" aspect is just a plain PostgreSQL database backup (at VIS) and restore (at HU). Which data should be removed is something we discussed extensively, for example in #3. Part of that is to (1) remove all evidence without the "DhiMu" tag, and (2) remove five tags, "DhiMu" and "eOC" among them.

In case you are confused about that: there is a difference between removing a tag and removing evidence with that tag. In #3, we are going to remove the "DhiMu" tag because all evidence in the export would have that tag anyways.

If you are still unsure about how the evidence and tags are connected, I encourage you to take a look at the database schema PDF, and the documentation.

tutebatti commented 2 years ago

"production" is just the common term used for this (see here). And "productive server", to me, sounds as if the server is doing all the work.

Interesting. Coming from German, I would say it's the other way around, because "Produktionsserver" sounds odd, while "produktiver Server" sounds ok. In English, anything can work as an attribute...

tutebatti commented 2 years ago

I cannot document any precise workflows until #3 is resolved and I know exactly how the export actually should look like.

I see! Thanks for clarification. We are uncertain, how common it would be to use later/revised versions of the initial repository. Looking at the data more closely led to the conclusion that we really will need to put some final effort in revision of the current data before publication.

Further, I would need to actually do the migration to know if everything works.

I wanted to ask you anyway, if we should start setting up the life version already, at least the steps possible now.

there is a difference between removing a tag and removing evidence with that tag

Yes, that is what I wanted to get at. Thanks for clarification. So the evidences are removed in a first step and the tags in a second. Right?

tutebatti commented 2 years ago

To sum up what was discussed regarding the general procedure and the original question: in the future, we would make changes on the production server at Uni Stuttgart and go through the procedure described under 3. in https://github.com/UniStuttgart-VISUS/damast/issues/91#issuecomment-1033562076 above every so often, e.g., every half a year. Right? If so, we can put this issue on pause - or close it, if the rest of the discussion will be handled via #3.

tutebatti commented 2 years ago

@mfranke93, your thoughts on https://github.com/UniStuttgart-VISUS/damast/issues/91#issuecomment-1035092158?

mfranke93 commented 2 years ago

I wanted to ask you anyway, if we should start setting up the life version already, at least the steps possible now.

The earlier, the better, yes. I would suggest you do not open it up to the internet yet, but it would be good if the VM and infrastructure is already prepared. It won't hurt to set it up completely already; applying an update to the software and data later is a quick operation.

Yes, that is what I wanted to get at. Thanks for clarification. So the evidences are removed in a first step and the tags in a second. Right?

Yes. There are a few more operations happening as well though, and the workflow is not set in stone yet because I am yet missing feedback about some relevant issues (like this).

But the core mechanic is: remove evidences, remove tags, remove more evidences (from sources not included), clean up instance tables (place*, person, religion_, time_instance) that are now orphaned (not connected to evidence), remove orphaned annotations, remove sources and documents.

To sum up what was discussed regarding the general procedure and the original question: in the future, we would make changes on the production server at Uni Stuttgart and go through the procedure described under 3. in https://github.com/UniStuttgart-VISUS/damast/issues/91#issuecomment-1033562076 above every so often, e.g., every half a year. Right? If so, we can put this issue on pause - or close it, if the rest of the discussion will be handled via https://github.com/UniStuttgart-VISUS/damast/issues/3.

I think that makes sense, yes. As soon as everything is resolved and we can actually do the data export and see that everything works, I will also place the appropriate scripts and (brief) documentation somewhere. I do not think it belongs in this repo, but maybe we can use the wiki functionality of GitHub for that, and put it as an article there?

UniStuttgart-VISUS / damast

Keeping data synchronous between public version and productive version #91