EIDA / wfcatalog

EIDA NG WFCatalog implementation
5 stars 8 forks source link

WFCollector: consistency after delete and update operations #4

Open petrrr opened 7 years ago

petrrr commented 7 years ago

We are very interested keeping our wfcatalog in sync with the file present in the archive. In particularly we were looking into removing and updating documents after files are removed. This occasionally happens due to data curation. So we are positively surprised to see that a delete operation was recently added to the WFCollector.

However, after some code auditing I suspect that the logic of these operation might be flaw and would not ensure consistency between waveform archive and wfcatalog. I might be wrong and this is just my lag of understanding the details.

In particular the delete operation seems not to update all potentially affected documents:

For the update operation the effect seems to be somewhat minor:

I understand that especially for high sampling the effect of this "details", might be minor, but for low rates this will have an important impact.

Where I am missing something?

petrrr commented 7 years ago

We just noted that the query, which we think would be needed to find all documents affected by a file change/update/removal and therefore would need an updates has already been implemented. However, it seems not to be used:

https://github.com/EIDA/wfcatalog/blob/master/collector/WFCatalogCollector.py#L1238-L1243

Jollyfant commented 7 years ago

Hi petrrrr, you are right to assume that the Collector assumes that each file in the archive corresponds to an entry in the MongoDB. It is kept in a single file, and the edges may bleed over in to the two adjacent day files. This is true for any sampling rate having more than a sample per day.

The update and delete operation go over the input files and identify each document in the database by its file identifier (e.g. NL.HGN..BHZ.D.2017.001) and deletes or updates this particular document. For an update, all the files used in the calculation are checked for checksum discrepancies. If a change is detected in any of the three potential files, only this single file document is updated. The other two documents (relating to the edge files) are NOT updated, regardless of whether they are changed or not. These need to be passed separately to the Collector as input before any change will be made to their database entry.

Initially there was a cascading update design like you expected and it would update every file and its dependents. That is why the query is there. The change to this was a deliberate choice because it means only the daily stream documents in the database that are given as input (i.e. filenames) to the Collector can be changed -- and that made more sense to me from a user perspective. The process is updating database documents (identified by a filename) and not files, so to say. The alternative would be to find all documents that depend on that file, and update all of those, but is quite a trivial change. However, it will not prevent any inconsistencies in the archive/database. A better idea would be to keep track of exactly what files go in and out of the archive using some kind of messaging system like RESIF is using.

If you delete a document, its edge files need to be passed to the update routine to make sure metrics are consistent with the archive. It would be an improvement to automatically reprocess edge files after a deletion is requested. I will look in to this!

It can be a bit confusing but I hope that clears up some things on the current state of the Collector.

Best, Mathijs

petrrr commented 7 years ago

Sorry for the late follow-up. Even if I understand that there might be some use-case where you want to organize updates around the document(s) for a specified day(s), I believe that the current behavior and nomenclature is quite misleading and might lead to unexpected results for the users.

The change to this was a deliberate choice because it means only the daily stream documents in the database that are given as input (i.e. filenames) to the Collector can be changed -- and that made more sense to me from a user perspective. The process is updating database documents (identified by a filename) and not files, so to say.

I do not agree that In makes more sense, it is just a different, maybe useful, way to operate. But often your starting point for reprocessing will be a set of files which have changes or were removed for what ever reason. In this case you want to update the database to reflect these changes on file level;

I would expect that any operation where you specify files, would update/delete all relevant documents in the DB. Instead if you would like to update specific documents, this should be specified with a different semantic (documentID, SeedID/date combination, etc.) But the term file should be avoided;

However, it will not prevent any inconsistencies in the archive/database.

Why this? I do see any problem here. Of cause you need to track your file changes (add, update, remove);

A better idea would be to keep track of exactly what files go in and out of the archive using some kind of messaging system like RESIF is using.

I agree that we need to operate "event-driven" (that is what we are after), and maybe use some process queue (less critical for consistency, useful for operations). What ever the exact mechanism is, the implementation of the actions to execute on the database still could to be based on what already was implemented with the collector. The logic should be practically the same, independently of how it is activated.

BTW: Has the RESIF solution been made available somewhere?

Jollyfant commented 7 years ago

I do not agree that In makes more sense, it is just a different, maybe useful, way to operate. But often your starting point for reprocessing will be a set of files which have changes or were removed for what ever reason. In this case you want to update the database to reflect these changes on file level;

I would expect that any operation where you specify files, would update/delete all relevant documents in the DB. Instead if you would like to update specific documents, this should be specified with a different semantic (documentID, SeedID/date combination, etc.) But the term file should be avoided;

I think you make a good case on the file/document semantic discussion. I've updated the source code to reflect the following changes:

--update collects all the documents that depend on the given file and updates these. However, only the checksum of the given file is checked. So, 005 may depend on 004. If you --update 004 it only checks for changes in this file. --delete will automatically update neighbouring files if they are not included in the deletion.

Why this? I do see any problem here. Of cause you need to track your file changes (add, update, remove);

Because like you said you need to track changes manually. It's not the job of the Collector without doing a full file system scan.

I agree that we need to operate "event-driven" (that is what we are after), and maybe use some process queue (less critical for consistency, useful for operations). What ever the exact mechanism is, the implementation of the actions to execute on the database still could to be based on what already was implemented with the collector. The logic should be practically the same, independently of how it is activated.

BTW: Has the RESIF solution been made available somewhere?

I think it's integrated in their system and not some loose component.