Organizing and documenting the databank

ohsOllila commented 1 year ago

    @ohsOllila I have a couple of general comments and questions actually:

While working on the code, I noted some inconsistencies in data organization. One example is different entries in the mapping files for various force fields (Lipid17 FF has a field RESIDUE, while other force fields do not). I understand the reason for that, but it doesn't seem general and such decisions might lead to limitations in the future. Therefore, a question is what is the long-term plan for the DataBank? How do you expect others to contribute to the code base in the future? Will things largely depend on you as it is now (at least, as far as I understand the current mechanism, you are the key person to contact for potential contributors)? Probably making some clear instruction on how one can contribute to the code base is worth?
The second question that I have is relating to the test coverage of the code used for analysis. Have you thought about this? In my opinion, this might be quite important in the future, especially if at some point you decide to make some changes in the data structure, it can be quite easy to forget to adjust all the analysis scripts accordingly and unit tests can help to track potential issues.

I guess, it is worth opening new independent issues to discuss those questions, so if you think those discussions are important, let me know and I will open separate branches then.

Originally posted by @pbuslaev in https://github.com/NMRLipids/Databank/issues/2#issuecomment-1337308669

ohsOllila commented 1 year ago

@pbuslaev brought up these issues in another issue about PCA analyses https://github.com/NMRLipids/Databank/issues/2. As these issues are becoming timely now, I opened this issue about the general organization and documentation of the databank. This is becoming very important now and we may have to open also other more specific issues, but here are replies to Pavel's questions.

While working on the code, I noted some inconsistencies in data organization. One example is different entries in the mapping files for various force fields (Lipid17 FF has a field RESIDUE, while other force fields do not). I understand the reason for that, but it doesn't seem general and such decisions might lead to limitations in the future.

This is now handled in the analysis codes such that they first try to find residues from mapping files, and if not found, then they use the names from README.yaml files. In principle we could put residue names for all systems in mapping files, but because this is not needed for most cases, we have done it like this to simplify the generation of mapping files which has to be made manually. I do not think that this compromises the generality because it can be always done in such a way that residues in mapping file are used if available. It slightly complicates the analysis codes, but I think that we could make readily available tool for this which gives the residue name.

Therefore, a question is what is the long-term plan for the DataBank? How do you expect others to contribute to the code base in the future? Will things largely depend on you as it is now (at least, as far as I understand the current mechanism, you are the key person to contact for potential contributors)? Probably making some clear instruction on how one can contribute to the code base is worth?

The second question that I have is relating to the test coverage of the code used for analysis. Have you thought about this? In my opinion, this might be quite important in the future, especially if at some point you decide to make some changes in the data structure, it can be quite easy to forget to adjust all the analysis scripts accordingly and unit tests can help to track potential issues.

Idea is definitely to make is such that it does not depend on me. However, there are two different aspects on this.

There is the core of the databank where README.yaml files, databank generation codes and standard analyses are stored (https://github.com/NMRLipids/Databank repository). The idea is that the core components of this will be changed less frequently and this will be curated more carefully. How this will be done in details, is not yet exactly clear thought. Current idea is to keep it in the current format for some time (unless something critical appears) and then update to the next version when we have more experience in using it and have clear versioning and other plans.

Another aspect are analyses made using the data in the databank (for example in this repository for the manuscript: https://github.com/NMRLipids/DataBankManuscript). Idea is that such analyses can be done completely independently by anyone. Of course, if we change the core components in the core databank, analysis codes may not work anymore for new versions. In these cases the users can either remain using the old version or update their codes. For this we need clear versioning, documentation etc.

So we need two types of instructions 1) how to contribute and version the core databank (https://github.com/NMRLipids/Databank) and 2) how to make own analysis codes that are using the databank. In addition, we also need instructions how to add data in the databank. There are some instructions available starting from README file in the core databank (https://github.com/NMRLipids/Databank) but these are not fully up to date. There is also some information and exercises available in the material for the last year summer school: https://github.com/NMRLipids/DatabankExercises. However, I have been focused on writing the manuscript and grant applications recently and had not time to keep these updated. Anyway, this is becoming very timely now. I would be very happy to hear if someone is willing to help in writing these instructions.

batukav commented 1 year ago

Current idea is to keep it in the current format for some time (unless something critical appears) and then update to the next version when we have more experience in using it and have clear versioning and other plans.

I think this is reasonable.

So we need two types of instructions 1) how to contribute and version the core databank (https://github.com/NMRLipids/Databank) and 2) how to make own analysis codes that are using the databank. In addition, we also need instructions how to add data in the databank. There are some instructions available starting from README file in the core databank (https://github.com/NMRLipids/Databank) but these are not fully up to date. There is also some information and exercises available in the material for the last year summer school: https://github.com/NMRLipids/DatabankExercises. However, I have been focused on writing the manuscript and grant applications recently and had not time to keep these updated. Anyway, this is becoming very timely now. I would be very happy to hear if someone is willing to help in writing these instructions.

Do we have a deadline for this? I can try helping out with the step 1).

ohsOllila commented 1 year ago

Do we have a deadline for this? I can try helping out with the step 1).

I do not have a deadline for this, but the sooner it is done the better.

pbuslaev commented 1 year ago

Hi,

This is now handled in the analysis codes such that they first try to find residues from mapping files, and if not found, then they use the names from README.yaml files. In principle we could put residue names for all systems in mapping files, but because this is not needed for most cases, we have done it like this to simplify the generation of mapping files which has to be made manually. I do not think that this compromises the generality because it can be always done in such a way that residues in mapping file are used if available. It slightly complicates the analysis codes, but I think that we could make readily available tool for this which gives the residue name.

In my opinion, mapping by itself are extremely useful and can be valuable beyond the Databank, for instance, they can allow researchers to compare two different force fields for lipids directly. Thus, in my opinion they should have general form.

The other question that I actually have here, is why should mapping files be generated manually? Won't some graph matching approaches help to automatize this process?

Another aspect are analyses made using the data in the databank (for example in this repository for the manuscript: https://github.com/NMRLipids/DataBankManuscript). Idea is that such analyses can be done completely independently by anyone.

But do you then want these external analysis tools to be also added to main branch, or you want them to exist on users' local forks only?

In my opinion, it can be good to allow external users to add tools to main branch, but the code has to be for sure reviewed by someone. So one potential solution can be that you have a pool of core-developers, two of which should approve any pull-request. This will help to keep the quality of the code at a certain level and offload some of your current tasks onto others. Of course some development guides have to be created, but probably it is something that can be discussed within a broader audience, e.g. NMRLipids workshop/school.

Of course, if we change the core components in the core databank, analysis codes may not work anymore for new versions.

In my opinion, separating things in such a way is a very bad thing. If some of the analysis code will break after some changes in the core, I bet this code will never be returned to the working state again. This is why I am suggesting having test coverage, which is easy to do in my opinion (just running all the code for a representative subset of trajectories via CI/CD pipelines for every pull-request).

So we need two types of instructions

Again, having a developer's guide and setting up a code-review process can solve this in the future. It has to be a rule, that all pull-request must include the documentation and are not merged until the instructions are satisfactory. Then, all the changes are easy to track.

ohsOllila commented 1 year ago

In my opinion, mapping by itself are extremely useful and can be valuable beyond the Databank, for instance, they can allow researchers to compare two different force fields for lipids directly. Thus, in my opinion they should have general form.

I see the point. This would argue for adding residue information into all mapping files. Only drawback that I can see is that making them needs more work, but this is probably incremental.

The other question that I actually have here, is why should mapping files be generated manually? Won't some graph matching approaches help to automatize this process?

There is a code by @patrickfuchs available which is doing this: https://github.com/patrickfuchs/autoLipMap. However, I think that its results needs to be manually checked and we have been mostly doing mapping files manually by modifying already existing file with closely related structure. My feeling is that it is not trivial to make a code that would universally and robustly work for all kinds of lipids.

Another aspect are analyses made using the data in the databank (for example in this repository for the manuscript: https://github.com/NMRLipids/DataBankManuscript). Idea is that such analyses can be done completely independently by anyone.

But do you then want these external analysis tools to be also added to main branch, or you want them to exist on users' local forks only?

I think that they should be in completely independent repositories (not in any branch of https://github.com/NMRLipids/Databank). Idea is that user will clone the https://github.com/NMRLipids/Databank locally and refers to its local files from a code in a separate analysis repository. In such usage, user just needs the local copy of https://github.com/NMRLipids/Databank but does not need to touch it. Results in the analysis repository can be then organized in the same way as in the main databank, but the idea is that they will remain in an independent repository. This is how the analysis for the manuscript is done in here: https://github.com/NMRLipids/DataBankManuscript.

In my opinion, it can be good to allow external users to add tools to main branch, but the code has to be for sure reviewed by someone. So one potential solution can be that you have a pool of core-developers, two of which should approve any pull-request. This will help to keep the quality of the code at a certain level and offload some of your current tasks onto others. Of course some development guides have to be created, but probably it is something that can be discussed within a broader audience, e.g. NMRLipids workshop/school.

I hope that the databank will be used for wide range of applications. I do not want to include all the applications in the main branch to keep it concise and easy to maintain. Therefore, the idea is to keep just few core analyses in the main repository. Currently these are C-H bond order parameters, area per lipid, x-ray scattering form factors, membrane thickness, and now PCA equilibration is to be added. All the other analyses should be done in separate repositories which do not need to be curated by core-developers.

Of course, if we change the core components in the core databank, analysis codes may not work anymore for new versions.

In my opinion, separating things in such a way is a very bad thing. If some of the analysis code will break after some changes in the core, I bet this code will never be returned to the working state again. This is why I am suggesting having test coverage, which is easy to do in my opinion (just running all the code for a representative subset of trajectories via CI/CD pipelines for every pull-request).

My idea is that the above mentioned core analyses should be separated from the rest analyses to keep the core databank concise and maintenance light. For the core analysis tools and functions in the "API", I agree that we should do something that you suggested. However, we should not do this for all the analyses codes that will use the databank, and we should enable people do their own analyses independently without touching the core databank.

So we need two types of instructions

Again, having a developer's guide and setting up a code-review process can solve this in the future. It has to be a rule, that all pull-request must include the documentation and are not merged until the instructions are satisfactory. Then, all the changes are easy to track.

I think that we still need two types of instructions. For people who analyse the databank without contributing to the core, and for the people who contribute to the core databank.

patrickfuchs commented 1 year ago

The other question that I actually have here, is why should mapping files be generated manually? Won't some graph matching approaches help to automatize this process?

There is a code by @patrickfuchs available which is doing this: https://github.com/patrickfuchs/autoLipMap. However, I think that its results needs to be manually checked and we have been mostly doing mapping files manually by modifying already existing file with closely related structure. My feeling is that it is not trivial to make a code that would universally and robustly work for all kinds of lipids.

Regarding the automatic generation of mapping files, indeed autoLipMap uses graph matching to generate automatically dictionnaries from universal names to force field names. The only requirement is to first build (usually by hand) a graph of the molecule with universal names. Once it is done for a given molecule, autoLipMap can build automatically any mapping file for any force field using just a pdb file of the molecule. So the only "hard work" is to build the initial graph using universal names. But if it is quite easy to do for two related molecules, for example if we have this graph for DPPC, we can build DOPC or DPPE quite easily.

Right now, many mapping files already exists, so in this case it is possible to use autoLipMap the other way round. For example, one thing I could do for cholesterol was to reuse a CHARMM36 mapping file written by @ohsOllila (as well as a pdb of cholesterol) to automatically build the graph of universal names. I didn't document this yet, but I can in the coming months.

I agree the only issue is to verify manually the results. I kind of agree with @ohsOllila, it is not easy at all to fully automate this step. But the check can be done visually using a nx.draw*() from the networkx module, see for example here. So it's less of a pain than checking a text file.

Don't hesitate if you want more info on autoLipMap or need to use it. Last, I saw in a recent post that a new yaml format was used for mapping files. This is something I can implement in the first semester of 2023 if needed.

pbuslaev commented 1 year ago

One of the things we discussed with @batukav and @ohsOllila in the #2 thread is that developing of preprocessor class could be beneficial. I guess it can largely be based on Parser class from NMRPCA_timerelax.py.

To work on this we probably need to create a new branch and have rights to push commits to the repository. Currently, I don't have such permissions and I always have to clone the repository to my personal github and only then create a mergerequest to the Databank project. Such an approach can be a limitation for collaborative development. Thus, I see two options now:

to work on the implementation on a separate fork and then make one larger MR which @ohsOllila will have to approve
we get at least temporary rights to push directly to Databank repository (this would be the preferred option)

ohsOllila commented 1 year ago

I have now given write permissions to the repository for @pbuslaev and @batukav. However, I protected the main branch such that merges to that requires a pull request. Let me know if this is ok and if these work?

pbuslaev commented 1 year ago

I think that it should do its job. In general, maybe it's good to have a short meeting and discuss the plan for the implementation and decide on some milestones to avoid working in the wrong direction. What do you think?

batukav commented 1 year ago

Sure, I'd be open to have a discussion about the development.

ohsOllila commented 1 year ago

I am now busy with the grant proposal, but I could have a meeting sometime between 3.2.-8.2.

ohsOllila commented 1 year ago

I am moving this discussion from issue https://github.com/NMRLipids/Databank/issues/2 to here.

Just to make sure: is there any information that the overlay databank http://databank.nmrlipids.fi/ reads from this databank repo?

I know that it is reading it because I am in contact with the developers. If you want to know more, we can open a new issue for this discussion?

I think it would be nice to know what information it reads and how it reads it, so that we make sure not to break anything while adding new code.

My understanding is that http://databank.nmrlipids.fi/ reads only information from https://github.com/NMRLipids/Databank/tree/main/Data folder and does not use any scripts from https://github.com/NMRLipids/Databank/tree/main/Scripts. Therefore, if you work with analysis scripts and tools in https://github.com/NMRLipids/Databank/tree/main/Scripts, you should not break anything.

Also in general, the structures in Simulations and experiments folders (i.e. README.yaml files) https://github.com/NMRLipids/Databank/tree/main/Data should be changed only when it is absolutely necessary and the changes should be considered very carefully. However, as long as you do not touch these, you can write new classes and other things without affecting the functionality of old codes.

pbuslaev commented 1 year ago

I am now busy with the grant proposal, but I could have a meeting sometime between 3.2.-8.2.

Can we maybe plan a meeting for 06.02? By that time I will try to prepare a plan for how to implement the preprocessing class.

batukav commented 1 year ago

Can we maybe plan a meeting for 06.02? By that time I will try to prepare a plan for how to implement the preprocessing class.

Anything after 16:30 CET should be fine by me.

ohsOllila commented 1 year ago

How about 16:30 CET on 6.2. then?

pbuslaev commented 1 year ago

How about 16:30 CET on 6.2. then?

I am fine it

batukav commented 1 year ago

Also fits me

pbuslaev @.***> schrieb am Mi. 1. Feb. 2023 um 14:47:

How about 16:30 CET on 6.2. then?

I am fine it

— Reply to this email directly, view it on GitHub https://github.com/NMRLipids/Databank/issues/128#issuecomment-1412086919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB47SNIL7UUFLIMR6V7ARX3WVJSP5ANCNFSM6AAAAAAS3U2TPI . You are receiving this because you were mentioned.Message ID: @.***>

pbuslaev commented 1 year ago

Is there a link for the discussion today?

ohsOllila commented 1 year ago

Here is the meeting information:

Samuli Ollila is inviting you to a scheduled Zoom meeting.

Topic: NMRlipids databank organization Time: Feb 6, 2023 05:30 PM Helsinki

Join Zoom Meeting https://helsinki.zoom.us/j/66377367308?pwd=WStKTitkaktHSTBwY3ZQUm9tc3Zidz09

Meeting ID: 663 7736 7308

ohsOllila commented 1 year ago

Here are suggestions for the benchmark systems that we discussed yesterday:

f32/81d/f3281d1071b0644075df0156d7233bd1109e99dd/8788d2e0768fe7e7e9784d3990494716071c627c/ 051/412/051412bed812cc77b3a749ee330cd740ab78eeef/051412bed812cc77b3a749ee330cd740ab78eeef/ afa/b24/afab24a3270f4289a86544d8f6e9b3bd72395d3e/5b4aadb79ec3804f4ca138b684e1cc23d7e3833b/ 03a/98a/03a98aeace25f568384fe703e1f0d3b4ac01af25/5004da904d772ea193d59c0a0c6a3b830b713816/ 372/817/3728174b1be9613822cc98a979dcdac176e057c4/9559e5c7df710415a2df60fc78de76d3a78d7699/

We can then change or add something later if needed.

pbuslaev commented 1 year ago

Hi, great! I will then prepare the test benchmark and a test script soon on a separate branch.

ohsOllila commented 4 months ago

I am not sure if work discussed here was finished, but many things have changed since, so I will close this.

NMRLipids / Databank

Organizing and documenting the databank #128