Closed SDev5 closed 1 year ago
Hi Sercan (@SDev5), thank you very much for this complete assignment! It is very good, so I do not have much to comment on it :-) The list of datasets is very comprehensive and it is really great that you dedicate the time to be very thorough. It is also very good that you separated data from code, but also that you identify different type of data and that you separated the code according to their specific use. Well done! It seems that your group has a very well thought storage and backup strategy! That's also great!. I find that that the use use of tools like JupyterHub, GitLab in synchronization with the group server is great. Also, very important is that the computers from the instruments are also sync with the server. And, I think using Project Data (U:) drive for the group is a very good practice. I am curious if you have a predefined folder structure for that? Data organization will be part of the assignment of week2, so probably I will see it then ;-) Very well done! Did you had a reflection before about the types of data you work with and listing them?
I am Paula by the way! I am the instructor supporting Esther in this run of course :-D
Hi Paula, thank you very much for your feedback. Before listing all the data types, I sat down and went through our typical research work flow. From there I identified all the data types I will be working with. So in that sense, I did have a reflection about the data types, but not prior to this course.
Maybe to clear something up, in our group we do not use the U: drive for data storage. The only data that is stored on the U: drive is the cleanroom inspection data which is collected on the computers of the cleanroom, the Van Leeuwenhoek Laboratory. The data collected there is not stored on the U: drive of our group, but on the U: drive of the Kavli Nanolab. This data is then again synchronized to our NAS such that we can access it from outside the cleanroom on all of our computers.
I have also asked my supervisor for some clarification with regards to the JupyterHub server and backups of our NAS. The JupyterHub server is apparently hosted by the faculty, but our directories are mounted on the NAS. The NAS itself is nightly backed up on university servers and these university servers are also backed up in a separate location, providing us with a very robust and easily accessible back-up and storage system.
Hi Sercan, thanks for the clarification! I didn't know that the Kavli Nanolab was another lab independent to your group :-)
It is great that you now know the backup strategy behind the infrastructure you use! And it is also great to know that the NAS server complies with the 3-2-1 rule :-)
Readme file for a dataset that I measured at the end of 2022/start of 2023: https://tud365-my.sharepoint.com/:t:/g/personal/sercandeve_tudelft_nl/ES6FuCwcAgFOkWndYkab1vkBmKIJ2TxoQRbyoeDpXlH0_g?e=tdI8eD
Hi Sercan, I have looked at your dataflow map. You have split out everything very nicely. I was wondering on how you store your metadata (what parameters you set for measurements, but also in your analysis code)? (I struggle with finding a good method for this myself :) )
Hi Annick, thanks for your feedback. For our measurements we have code in our measurement scripts that automatically generates a .txt file that stores the measurement parameters, so the axes and set parameter ranges, as well as the names of all collected variables. This meta .txt file is stored in the same folder as the measurement data, and the information that is contained within the meta .txt file allows us to automatically plot the measured data with correct axes names and ranges. I have added an example of such a .txt file below: https://tud365-my.sharepoint.com/:t:/g/personal/sercandeve_tudelft_nl/EYl6p1chAyxBp0CyNVmq9P4BVjbyhaNoU6mO0blqAEK3Tg?e=QClhPQ
Regarding analysis code, personally, I work in Jupyter notebooks, so I can directly write documentation in between code cells. I think this works well for now and allows me to easily reproduce results. Currently, I don't store separate metadata for processed or analysed data outside of the Jupyter notebooks and I don't think we have a standard way of doing this in our group. But it might be good to develop that, and if I would do so, I think I would use a similar format as for the measurement metadata.
Hi @SDev5 Sercan! Very well done again! Thank you very much for taking the time to reflect about each theme for each data type and code! :-) I do very few comments actually. I do write a lot, but they are a few comments :-P
I think it is excellent that you could adopt Jupyter notebooks for the documentation of data, code and data analysis. I think that Jupyter notebooks accompanied with a complete information in the form of Markdown comments, docstrings for the functions and maybe addition of some extra metadata are a great way for other to reproduce your work. I know some researchers who always publish their Jupyter notebooks with the data analysis/workflow to create the figures presented in their articles. Maybe that is something you could think about ;-) I especially like that you thought of possible solutions to implement your folder structure and also about ways to improve code versioning to be adopted by your group! I really hope they adopt your idea of transferring the scripts for data collection to GitLab. I think it should possible to add shortcuts to the folders were data is automatically stored. If not, you can keep the folder in your structure and create a ReadMe file (.txt) to provide a bit of context on the measurement and indicating the file path to find the right data for that project. About file formats. If there is a licensed software that needs to be called when creating your simulations, it will be important to add that information in the documentation of the code. Which software is it, maybe the version and the distributor. For the measurement data, you say that is collected in .dat which is open, but it could be converted to .csv. Do you think that this conversion provides an advantage for a potential re-user? Because if .csv will not make a difference to .dat, then maybe you can save that step? A general question about data access to the design-related data. Is it normally not publicly available because your group or you expect to exploit this commercially or patent it? Within open science some researchers have engaged on what is called Open Hardware, where design documentation and code is made available to anybody who would like to fabricate devices. I am not saying your group should go that way, I guess there are reasons to keep that data/code with restricted access. But, if you want to know more about Open Hardware, there is a very nice community at TU Delft :-) https://www.tudelft.nl/open-hardware About data publication. It is great news that your group use Zenodo, it is a very good choice for data publication. About the license, I would like to do three remarks:
Very well done! I hope the exercise was useful!
Thanks for sharing your READme file @SDev5! It looks great, well done!
A couple of things:
I hope this helps!
Introduction
Hi all, I'm Sercan Deve and I'm doing my PhD in the department of Quantum Nanoscience on superconducting quantum circuits.
Reflections on the importance of RDM videos
I think that the facilities in our lab with regards to data management limit the possibilities of losing large amounts of data, since all of our data is stored not only locally on the lab computers, but also on a NAS device (network-attached storage) which is also backed up. Additionally, we are encouraged to push all relevant files and code related to our projects to Gitlab. Personally, I also sync all my files to OneDrive. Nonetheless, these videos reminded me again of the importance of good data management and proper documentation, something that I should sometimes pay more attention to. The five reasons for reproducible data by Markowetz were really informative and he definitely convinced me that working reproducibly is the way to go. My data horror story is related to a library in our lab that we use to write our measurement code. This library is on git, but every measurement pc has its own version with local changes. Another story is from a recent measurement run that I did, in which I made a mess of all the different measurement file names...
What would you like to learn during this course?
I am interested in learning more about the FAIR data principles and how to apply them to my research data. This is a term that I have heard several times, but I never found the time to look more into it. Additionally, I would like to learn the best practices for data management and see if these practices are implemented in our lab. If not, I would like to implement them.
Checklist assignments