Restructuring the repo - Githubissues

batukav commented 4 weeks ago

Hi all,

This issue was brought up by @comcon1 here on Issue #195 and also had been keeping my mind busy for a while.

Currently we are tracking and versioning both the code (in Scripts directory) and output (Data directory) together. This has the potential to make tracing the history rather difficult in the future, plus could create headache with merge requests. Hence, I'd like to discuss how we can restructure the Databank such that we can version/track the code and data separately.

One potential solution is to use the Github submodules as @comcon1 suggested. I'm not informed enough of the submodules' capabilities.

Other potential solution is to move the data to a cloud storage and version it using some data versioning tool like DVC. This would mean that the data and code will be hosted at different places, might be a bit more difficult to maintain but a neater solution.

Since we also have the GUI, we should be careful to not break its integration. Unfortunately I'm not well aware of how the GUI integration works, so I also would like to gather some info about it here.

@markussmiettinen , @ohsOllila any suggestions?

comcon1 commented 4 weeks ago

Update of the DB at GUI server is maintained by the special script, which now run by cron. Having a submodule Data, you can update Data part independently with git submodule update and the folder structure will be the same. Actually, at the server-side we will need to cron only updating Data part, whereas the Script part we can update manually (even better). You can make a toy repository to check how it works.

If we move Data to another repository, we will also have to make some changes in GUI part, when they recalculate paths of Git-stored files at client-level (when plotting graphs, for example). But that we can schedule with @fsuarezleston. We can first copy Data to the other repo. Then wait for fixes, and only then remove Data from the main repo.

Having Data in cloud is also a temporary solution. I don't see much difference from using GitHub or GitLab versioning against DVC. But I never used DVC, so I don't have a very strong opinion. But I suppose that solid solution - is to store data in the database and access it via some service. I suppose, repositories like PDB are organized like that. Noone stores this kind of data just as versioned files. But that is too hard for a pet-project - that should be outsourced to engineers after getting some funding for the entire project.

But removing data from the code repository is anyway an inevitable step on a way to the solid solution.

One thing we should agree on: where do we store auxiliary files like mappings and UA-jsons. I suppose they should be in the code-part. And regarding info files - I would just remove them. I don't see any profit of storing info-s.

batukav commented 3 weeks ago

just out of curiosity: since the DB for the GUI hosted somewhere, can't we also store the data in the same server?

But I suppose that solid solution - is to store data in the database and access it via some service. I suppose, repositories like PDB are organized like that. Noone stores this kind of data just as versioned files. But that is too hard for a pet-project - that should be outsourced to engineers after getting some funding for the entire project.

Same question as above: as we already have the DB that serves the GUI, can't we simply build an CLI/API around it?

comcon1 commented 3 weeks ago

It's not a problem to host database somewhere, and on this server as well. The problem is that this database was build just for GUI presentation. It doesn't contain complete data from Data/simulations subfolder. It contains some links and when graphs are plotted, they are downloaded on JS side directly from github. You can see everything connecting Databank and GUI part here: https://github.com/comcon1/Databank/tree/modularize-r2/Scripts/updateGUI It's a branch, I'm actually working on now. But the GUI-part almost doesn't change.

There I put also the DB schema in the SQL file, so you can see how the tables are organized.

Current database also doesn't support versioning. So it's quite a lot of work

to make this DB
make CLI interface around it
make NMRlipids' scripts working with it
make GUI work with it
make reserve copying of the DB on the third-party side.

Everything is possible but it's quite a lot of work. But the Data should be anyway removed from scripts. If we are agree with it, let's do at least it. Then we can plan next steps.

comcon1 commented 3 weeks ago

I rechecked the schema. In principle, it contains already all the metadata on the computed JSON data. What is it doesn't store is the JSON datafiles themselves. JSON datafiles could be stored in a serialized form in a separated table (or even database) within the BLOB field. There is MongoDB which stores JSONs as specific BSONs and respond JSONs very quickly because its specifically designed for that.

We can use the same strategy to mine the analysis data and then add a layer of accepting JSONs into the database (in github actions or in server-side cron-script). We can sometimes synchronize this database with the github repository to whatever side. In the end we can get rid of github data repository at all. This strategy allow us to move to it gradually without need of refactoring the whole project. Anyway, we can separate the Data as the first step.

batukav commented 3 weeks ago

It makes sense to initially create a submodule for theData. I'm okay with doing this if nobody objects and can handle it in the upcoming days.

Just for clarity, whenever I mention GUI I have this in mind.

Let's plan the next steps after separating the Data and updating everything with the results from Issue #195

ohsOllila commented 3 weeks ago

If we move Data to another repository, we will also have to make some changes in GUI part, when they recalculate paths of Git->stored files at client-level (when plotting graphs, for example).

What if we move to codes to another repo and keep the data in the current repo? My feeling is that then GUI would not need updates because it is just plotting the data? Or does it use some codes also?

Regarding alternative storage space for data, I think that it should be a solution that is stable independently on any of us or other individual person (such as git+Zenodo version). For example, current GUI is available only as long as someones pays for the server and company running it remains active. For me the simplest next step solution would be separation to two gits. However, I understand that git may not be the best format for data in the long run.

comcon1 commented 3 weeks ago

If we create a submodule, then Data repository should contain everything inside current Data folder. So the paths will be broken anyway. So it will require to be fixed in the configs of GUI code. If the path is hardcoded in GUI, it's anyway bad and should be fixed.

I will separate submodule in my local version to check how it works. You will be able to clone my version and play with it.

comcon1 commented 3 weeks ago

I have done it clearly within our local gitlab. Please have a look: https://git.app.uib.no/Aleksei.Nesterenko/Databank

It's now 55M. Data is a submodule. It's 370 M. Histories are fully separated. I did that using git-filter-repo utility, which rewrites history.

You can clone and then do

git submodule update --init --recursive

Or you can clone with the flag:

git clone --recurse-submodules https://path-to-repo

This is based on my current development branch. Not the main branch. But it doesn't matter for viewing.

NMRLipids / Databank

Restructuring the repo #200