Open motey opened 4 years ago
We should also consider that not all data loaders have a simple update logic. I.e. they have to perform complex oerations to define the updates.
Example: The loading script that generates :Fragment
nodes with sentence from full text nodes (:BodyText
, :PatentAbstract
etc ). This has to rerun whenever we have new text. But the text fragments have no primary key except for the sentence itself. It would need to check every existing full text and check if all sentences exist (costly) or create the :Fragment
nodes only for full text nodes that have no :Fragment
nodes yet (error prone if the content of the full text node changes).
Btw gene databases are not static 😄
Btw gene databases are not static smile
I was allready afraid thats the case. but my brain just wouldnt come up with a good example at 1am :D
Example: The loading script that generates :Fragment nodes with sentence from full text nodes(:BodyText, :PatentAbstract etc ) [...]
imho the dataloader is the problem in this case :) What about a flag to fragged text, or a simple logic like "when textfraggments are on the node, no fragging is needed anymore" Changing fulltext nodes should be rather rare (and if changes should be rather subtile)
To prevent messed up data and enable possible new features we need to categorize dataloaders
none idempotent dataloader
Dataloaders that only run once inital. these are for static data like gene databases
idempotent dataloader
Dataloaders that will evolve and data will probably change. Like publication data in the CORD19 dataset which iterates from time to time.
If a rerun is neccesary could be decide by changing docker hub hashed (changing dataloader image)
service dataloaders
Data that will change in any case regulary, like covid case statistics. These dataloaders should run periodically