covidgraph / motherlode

Pipeline for running all dataloader scripts for covidgraph in a controlled manner.
https://covidgraph.org
MIT License
3 stars 1 forks source link

introduce different dataloader categories #8

Open motey opened 4 years ago

motey commented 4 years ago

To prevent messed up data and enable possible new features we need to categorize dataloaders

none idempotent dataloader

Dataloaders that only run once inital. these are for static data like gene databases

idempotent dataloader

Dataloaders that will evolve and data will probably change. Like publication data in the CORD19 dataset which iterates from time to time.

If a rerun is neccesary could be decide by changing docker hub hashed (changing dataloader image)

service dataloaders

Data that will change in any case regulary, like covid case statistics. These dataloaders should run periodically

mpreusse commented 4 years ago

We should also consider that not all data loaders have a simple update logic. I.e. they have to perform complex oerations to define the updates.

Example: The loading script that generates :Fragment nodes with sentence from full text nodes (:BodyText, :PatentAbstract etc ). This has to rerun whenever we have new text. But the text fragments have no primary key except for the sentence itself. It would need to check every existing full text and check if all sentences exist (costly) or create the :Fragment nodes only for full text nodes that have no :Fragment nodes yet (error prone if the content of the full text node changes).

Btw gene databases are not static 😄

motey commented 4 years ago

Btw gene databases are not static smile

I was allready afraid thats the case. but my brain just wouldnt come up with a good example at 1am :D

Example: The loading script that generates :Fragment nodes with sentence from full text nodes(:BodyText, :PatentAbstract etc ) [...]

imho the dataloader is the problem in this case :) What about a flag to fragged text, or a simple logic like "when textfraggments are on the node, no fragging is needed anymore" Changing fulltext nodes should be rather rare (and if changes should be rather subtile)