covidgraph / motherlode

Pipeline for running all dataloader scripts for covidgraph in a controlled manner.
https://covidgraph.org
MIT License
3 stars 1 forks source link

Refactor/Rewrite motherlode #5

Open motey opened 4 years ago

motey commented 4 years ago

Current Status

Motherlode is a proof of concept script at the moment. It works but the structure is not fitted for large scale expandability in future

Desired Status

Motherlode should be broke down to seperated classes and offer easy expandability and a more pleasent boarding for new devs

Tasks

[ ] Discuss possible structure/technologies with focus on future features [ ] Declare/Define and document structure [ ] implement changes

issues to take into account: https://github.com/covidgraph/motherlode/issues/8 https://github.com/covidgraph/motherlode/issues/7

hint: no-holds-barred: Change of plattform/language is possible if its serves the goal. Discussion is open

frankschmitt commented 4 years ago

Some ideas off the top of my head:

motey commented 4 years ago
* use Neo4J to keep track of dependencies / order the source systems (currently, motherlode determines this itself, but since we require a running Neo4J instance anyway, we can just as well use its graph algorithms for tracking this info)

Had the same idea. but this would make bootstrapping motherlode harder. Also wiping the database and refill it via motherlode will not be possible.

On the other hand having the information which datasources are loaded (and even the possiblity connect data to its datasource) is pretty compelling. Maybe a hybrid approach would be one good solution. This could be achieved by extending the :LoadingLog functionality ( https://github.com/covidgraph/motherlode/blob/a2560ef1ffde48efba6bffce106e146fb0ec0e86/motherlode/main.py#L38 )

parallelize loading. Currently, all loaders run strictly sequential; loaders that don't depend on each other can be run in parallel (if the Neo4J instance and Docker host can handle the load)

YES! i will create an issue for that