Overview issue [Nicolo` Magro]

Nico7575 commented 6 months ago

Introduction

Hi all! My name is Nicolo` Magro and I am a PhD Candidate in the Department of Radiation Science and Technology.

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

My research is twofold: on one had I want to design an high temperature gas nuclear reactor for maritime applications. This will be done by using an high fidelity multi-physics code already in use in the Department. On the other hand, I plan on developing a reduced order model for the above-mentioned high fidelity code, in order to make it faster and easier to use. The two lanes will then link together when I will use the new fast code to simulate my maritime reactor.

My research entails the following aspects:

Research Aspect	Answer
Use/collect personal data (health data, interviews, surveys)	No
Use/collect experimental data (lab experiments, measurements with instruments)	No
Collaborate with industry	No
Write/develop software as the main output of the project	Yes
Use code (as in programming) for data analysis	Yes
Work with large data (images, simulation models)	Yes
Other:	N/A

Reflections on the importance of RDM videos

I think reproducibility is fundamental for one major reason: a PhD research never ends with just one PhD Student, often multiple PhDs work on the same main lane one after the other. Therefore, it is necessary to keep track of how data are produced during research, to make it efficient and faster and to avoid that the same errors are made multiple times. This should be a responsibility of the researcher towards who will follow. An "horror story" I know comes form a friend that is doing his PhD in another University. His laptop got stolen and he had to retrieve the data he had on it through emails and presentations. He was not able to retrieve everything and this caused him stress and additional time.

What would you like to learn during this course?

I would like to know what are the tools we have to efficiently keep track of the data from our researches. And, if possible, how to use the most important ones, like GitHub)

Checklist assignments

[x] Assignment 1: creating a GitHub issue (the overview issue) (before 30 April, 10:00)
[x] Respond to the GitHub issue #125
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 7 May 13:00). Link: https://surfdrive.surf.nl/files/index.php/s/nOZyktkPoKPOBtK
[x] Provide feedback to at least one Assignment 2 from another participant
[x] Respond to the GitHub discussion #126
[x] Respond to the GitHub discussion #127
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 21 May 13:00). Link: https://surfdrive.surf.nl/files/index.php/s/9bhOVtQdwjh08IV
[x] Provide feedback to at least one Assignment 3 from another participant
[x] Assignment 4: Data Management Plan (before 29 May 13:00)
[ ] Respond to the GitHub discussion #128 if you have any questions (optional)
[ ] Assignment 5: Data Flow Map 3: submit your slide (before 30 May, 17:00)

anish8194 commented 6 months ago

The DMP is very thorough and talks about the simulation codes, data produced by the codes and further the analysis on these data created. I just have on question. You mention that everything will be stored on the HPC cluster so when you make every small change in the code and save it on HPC would the HPC then eventually run out of space or do you not expect that to happen?

EstherPlomp commented 6 months ago

Thanks for sharing assignment 2 @Nico7575! It looks very comprehensive, well done!

Like @anish8194 I have some concerns about the storage on the HPC clusters: I see that this storage solution is not automatically backed up? Having the same files spread over multiple HPCs might also not be ideal. In general, using HPC storage as your main storage solution can be very expensive. What if you run out of storage space indeed (great comment @anish8194!)?
With storage on your laptop, do you mean your local drive only? Note that this location is not automatically backed up!
You may also consider your file naming practices, see the FAIR organisation part of the course!

anish8194 commented 6 months ago

Hi Nicolo,

That is very thorough. You majorly produce code for simluations and I also understand that your codes are limited access but would it be possible to give some form of access to others so that they can cite your work if they use your code or part of it? Also, if the codes are limited acess how would it work during publications since most major journals ask for the code when reviewing?

EstherPlomp commented 6 months ago

Thanks for sharing assignment 3 @Nico7575!

Especially our data/folder organisation looks very clear, well done!
Also another note on storage locations: GitHub is not a good solution for confidential code, so if you need to restrict access to any code, Gitlab is more suitable: https://gitlab.tudelft.nl/.
The lessons by the Code Refinery on documentation and Jupyter Notebooks may be relevant to you
Why is your data restricted access? I don't see any collaborations with companies and you also use GitHub to manage the data/code? You will need to have a good reason for data to remain closed, as it is a requirement for your PhD defense to share the data/code underlying the publications!

Great questions by @anish8194 as well!

Nico7575 commented 5 months ago

Thank you @anish8194 for reviewing my assignment 3! Regarding your first question, only members of the research group will have the possibility to use the code (or parts of it). And in that case, they will have full access to it, so no problem with the citation. For the second question, to be honest I do not know how it would work with the review process for publishing. I guess maybe it is possible to send them the code for the review but they have to keep it confidential (?). I will check this more thoroughly.

Nico7575 commented 5 months ago

Thank you @EstherPlomp for the answer and the feedbacks! In the assignment I wrote GitHub, but apparently the research group utilize BitBucket for keeping track of modifications to the code. Would it be a better alternative for confidentiality? Regarding the data being restricted, I had to ask to my Supervisor about it and he confirmed that the access to the in-house code is allowed only to the research group working on this topic (probably because the group is working on the design of a reactor and there could be competition-related issues with making the code public).

EstherPlomp commented 5 months ago

Thanks for your answers @Nico7575!

During the review process it is probably possible to share the code with the reviewers. Work of reviewers is confidential so they should not reuse the materials without your explicit permission.

I'm not sure about BitBucket - I'll check with our Faculty Information Security Officer to see what their opinion is!

No worries about keeping the inhouse code behind doors - that is understandable! It might be difficult for you to still share other code in that case, but it may still be possible to share the resulting data?

RiteshDas2000 commented 5 months ago

Hi @Nico7575 , your plans look solid. I have a few comments on assignment 2.

I see all of your code is stored on a department PC and GitHub. Perhaps it would also make sense to store some of the code (ones that are 3gb) on a personal PC as well? That way you have two offline copies, one of which is accessible from everywhere.
The simulations are huge when it comes to data size. Is there a way to access the information online instead of having to download 2TB. Since these are simple csv or txt files, it should be possible to view online. Some people may be discouraged from accessing the data because of the huge size if download is neccessary.
I see you use .png files. .png files, especially this huge are notorious to include in tex files and the compilation time can become huge. Perhaps something like .eps files would make things easier.
I see you have included .tex files. Perhaps having pdf files also help? You can also use overleaf to simplify collaborating with co-authors and have the manuscript stored on overleaf servers securely.

RiteshDas2000 commented 5 months ago

Some similar comments on assignment 3.

Data organization: Everything looks perfect and you seem to have a solid plan for storing the code on Github. Since you use csv files, I am assuming you call these files often into the code. Does it cause problems in the code if you have special characters like "_" in the file name?
Documentation: Jupyter notebooks are of course popular and used by many people. But perhaps its also better to have .py files wherever possible?
Metadata: Perhaps you can include author names even if you are the one who runs the simulations.
File formats: Like earlier, if possible, maybe .py files can also be helpful.

DaphneCette commented 5 months ago

Hi, your assignments look really nice! Here are a few feedback comments:

Assignment 2 Really concise overall. Have you thought about a backup plan for the files related to the publications?

Assignment 3 Great naming convention! You mentioned that your data will only be accessible for the people who are working on similar topics within your department, does that also include BEP/MEP students?

EstherPlomp / TNW-RDM-101