Assignment 1 Sander Senhorst

Hihaatje commented 1 year ago

Introduction

Hey everyone, my name is Sander Senhorst. I'm a PhD student at the Optics group in the faculty of applied sciences. In my free time I like to play the piano or automate various things in my home (my household appliances are getting smarter by the day :) )

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

I am making a nanoscope (think microscope but then a bit more nano) for wavelengths in the extreme ultraviolet range (10 to 40 nanometers). We have a setup where we first generate this light (already pretty complicated) and then illuminate a sample with it, with the goal of making an image of (reconstructing) the sample. My specific project is oriented towards extracting everything from the data we gather, as there is likely more information present in the data (like 3 dimensional data or chemical specificity) than we are currently able to reconstruct.

My research entails the following aspects:

Research Aspect	Answer
Use/collect personal data (health data, interviews, surveys)	No
Use/collect experimental data (lab experiments, measurements with instruments)	Yes
Collaborate with industry	Yes
Write/develop software as the main output of the project	Yes
Use code (as in programming) for data analysis	Yes
Work with large data (images, simulation models)	Yes
Other:	N/A

Reflections on the importance of RDM videos

The video (and the ReproGame) were familiar, as I think is the case for any of us. I've recently been asked about my own work from a year ago (did you have this driver somewhere? What was this method again, etc.) and thought I put it all in my thesis. Well, as it was difficult enough for me to extract it, evidently not. I already used git in this project, so I could find all my code without any issue, but the packages used were proprietary from our instrument manufacturer, so in the end it was still not reproducible.

What would you like to learn during this course?

One thing I missed so far, but I see it is noted in the checklist, is collaboration with industry. My project is heavily sponsored by industry, which means I'm subject to all sorts of NDA's and conditions for releasing my work. All of my research output should be licensable by the users of the project, which means that sharing any meaningful output first requires jumping through a series of hoops which could easily take several months. Additionally there is also the TUD valorisation interests at play. From what I've seen so far, this makes reproducibilty a nightmare, since both publishing on what we do and fully sharing our methods is often mutually incompatible. I would like to find a way to be able to still do this, as it sometimes feels like I'm not a scientist at all, but rather an underpaid member of some R&D department.

Checklist assignments

[x] Assignment 1: creating a GitHub issue (before Class 1)
[x] Respond to the GitHub issue on 'data challenge'
[x] Assignment 2: Data Flow Map 1 (share a link in this issue before 17 May 13:00). Link
[x] Provide feedback to at least one Assignment 2 from another participant
[x] Respond to the GitHub discussion on 'licenses'
[x] Respond to the GitHub discussion on 'folder structure'
[x] Respond to this GitHub issue with your readme file
[x] Provide feedback to at least one readme file from another participant
[x] Assignment 3: Data Flow Map 2 (share a link in this issue before 31 May 13:00). Link
[x] Provide feedback to at least one Assignment 3 from another participant
[x] Assignment 4: Data Management Plan (before Class 2)
[x] Respond to the GitHub discussion on 'Data Management Plans' if you have any questions (optional)
[x] Assignment 5: Data Flow Map 3: submit your slide (before Class 2) link

Hihaatje commented 1 year ago

My data flow map can be found here: https://surfdrive.surf.nl/files/index.php/s/mTtHnecfL9vF0hI

EstherPlomp commented 1 year ago

Hi @Hihaatje! Thanks for handing in your assignment 2! I think you did a great and extensive job, so I only have a couple of questions/suggestions:

Do I interpret the flow correctly that you'll be sharing the 'raw/unprocessed' data as well? I like how you have the multiple publish arrows lined up, so it is easy to see what you plan to share at the end of the project!
Do you have an estimation about your data size? You don't need to exactly know how much data you'll generate, but there's differences between managing 1 TB of data or 100's of TB of data. The project drive is indeed the best solution for large data, as well as a safe storage solution for anything IP sensitive!
If ROI = region of interest: how do you define/analyse these? What documentation needs to accompany this? The follow up assignment 3 will go a bit deeper into that!

Well done!

Hihaatje commented 1 year ago

Hey Esther,

To answer your questions:

Yes, occasionally (so far mostly on request) we share the raw data with other groups to help with algorithm development. In turn we also use their data for the same purposes.
A single 'ptychogram' is currently about 20 GB in size. I expect we will at maximum take about 100 datasets, so this will be of the order of single terabytes of data. Thats only the ptychograms though. A single reconstruction is ~ 1GB, so if we average maybe 5 reconstructions per dataset it doesn't affect the total size too much.

By the way, in looking up the reconstruction size I learned that currently the datasets are all in a shared drive, but the reconstructions are only stored on our high-performance computer and are only accessible by the users who created them. So I'll definitely be looking for some better solution for this. The question is maybe how to manage this well; when we do reconstructions, we don't want to be limited by write speed to a network drive. I'd at least like a shared folder between several users where the reconstructions are stored, but then how to link this to a project drive is still an open question.

Also I'm thinking of a good way to organise this. Currently the experiments all have their own folders, ID's and metadata, but the reconstructions are stored in a completely different folder with only a timestamp as a name. I consider the reconstruction to be a logical result of the dataset, so I believe it should be stored alongside the ptychograms, so I would opt for this. But then there is still a problem: how to do we make that one reconstruction which was particularly successful stand out? We could just delete all other reconstructions, but then the metadata of what hyperparameters fails to reconstruct is not stored. This in in effect our current approach, as we only move a reconstruction to the shared drive when someone else asks for it (and it is thus successful enough).

The ROI's are very much determined on a reconstruction-to-reconstruction basis. They usually form the basis for some plots in an eventual publication. I think the documentation for this should be in the script which creates the plot, which then should always accompany the reconstruction (if such analysis was performed on the reconstruction). This is also the case now, but it should be noted that by far most reconstructions do not undergo such a process, so it can be difficult to find that one single reconstruction which was deemed good enough to publish about.

Edit: It seems like data repositories provide a nice way to structure data without having to worry too much about storage folders and make an overview of different datasets / reconstructions to make them findable, but I believe publishing to a repository usually also means the data is made public. Are there any private repository options (such as gitlab for code) which also allow for keeping track of the different datasets generated?

Hihaatje commented 1 year ago

My readme file:

https://surfdrive.surf.nl/files/index.php/s/r4IFIQab2lAEiem

Hihaatje commented 1 year ago

My Week 2 data flow map: https://surfdrive.surf.nl/files/index.php/s/LMpGngnlDBYQ66q

YiZhang025 commented 1 year ago

Hi @Hihaatje,

First of all, thank you very much for your response to my data flow. I have to say your flow map is more than clear even for a person who is not involved in your project! I found most of my questions have been already addressed by Esther - with only one possible question for your post-process: Will they be automatic and parameterized? As there is always a need to re-generate the results for many test samples (in my case), which will be quite tedious working in jupyter notebook/MATLAB command lines. Anyway, it was a good and inspiring flowmap.

EstherPlomp commented 1 year ago

Thanks for submitting Assignment 3 @Hihaatje! It looks very clear and extensive: well done!

Some thoughts on some of your comments:

File formats but I don;t know the specific consequences of SPE vs TIFF for example. As you indicate that there are open source solutions that can be used to read SPE files, it may be less prominent to convert your images to an open format. There is a fine balance to making things available in open formats and ensuring better preservation for the long term (because the file formats can be opened and maintained) - and reusability where all the functionalities are preserved. It could be possible to store two copies of files, but only when the data is not massive - otherwise sustainability questions of your data storage become more important.

.svg is indeed a good approach to illustrator files!

Data Publication It's great you found a disciplinary specific repository! CXIDB makes use of a standardised file format I see, which is fantastic. I do think that they indeed only use CC0 for the deposits. This is not necessarily bad, as they also highlight the norms of people citing the data that they reuse. But it is less enforceable compared to CC-BY. It can also be that re-users could distribute parts of the original dataset under a different license (which you wanted to prevent with your preference for CC BY-SA). All things to consider when sharing the data here!

Regarding restricted access: It is possible to either leave the data at university storage solutions (such as the project drive) and have your supervisor/PI in charge of the data once you leave the university. This is a question that will also come back in your data management plan :). You can alternatively also share the data/code in a data repository and place it under restrictive access - it should be possible for people to then contact you via a button/form, or to contact you via the contact details that you have provided in the READme file or publicly available metadata information. I hope this makes sense? I realise that this is a bit abstract when explaining it via text, so please let me know if it's not clear!

EstherPlomp commented 1 year ago

And thanks for sharing your READme! It already looks very clear and extensive - well done!

I like the inclusion of a link to the lab's website instead of physical address. That can indeed be more helpful :)
It can also help to include the ORCID information. An ORCID is a persistent identifier for you as a researcher. You can set up your own ORCID here: https://orcid.org/registerand have a look at my ORCID to see what this looks like when you have more research output.
The sample description information line is empty? Is that purposefully or does this need to contain more information?
I can't directly open the link to the further documentation in the pdf file - do I need a specific software to open it?

Hihaatje commented 1 year ago

Hey Yi,

Will they be automatic and parameterized? As there is always a need to re-generate the results for many test samples (in my case), which will be quite tedious working in jupyter notebook/MATLAB command lines.

Good question! If I go by the data processing we have currently done, then it was definitely not automatic. A lot of work was done by hand and is specific to a reconstruction, with very little possibilities for reuse. Some bits / tools that we use in this process can be reused though, so I'll be sure to store these in a separate project.

Hihaatje commented 1 year ago

Hey @EstherPlomp, thank you for your feedback! To answer your questions:

The sample description information line is empty? Is that purposefully or does this need to contain more information?

Eventually this will contain a short description of the sample information, however I currently don't know much about the samples (I based this on what data I could find from a previous project myself).

I can't directly open the link to the further documentation in the pdf file - do I need a specific software to open it?

Hmm I seem to have this issue too, I never checked the link. I think chrome no longer supports ftp and defaults to searching for the link. When I try to use proper ftp software to open the specification it seems I am also not in luck as it currently requires access rights, which I do not have. So it seems like the only source of the file format specification is inaccessible (great work Princeton Instruments!). I found another version somewhere else through extensive googling, but I don't believe this is supposed to be publicly accessible, so I doubt I can share it further?

YiZhang025 commented 1 year ago

Hi! Just checked your readme file - it's very clear to me. Just maybe a silly question: in your field, is the methodological information in the same format for other datasets? As I have some experience in CellProfiler in microscopy, it can generate meta information in the same format (mostly .XML), which is easy to manage.

EstherPlomp commented 1 year ago

@Hihaatje! No worries about placeholders or not knowing for sure yet what you'll need to fill out - I was just curious.

Can you please email me the manual so that I can perhaps have a look at whether it has any information about redistribution/reuse? Thanks!

Hihaatje commented 1 year ago

My final assignment slide: https://surfdrive.surf.nl/files/index.php/s/fzMug3GfNQaqfqB

EstherPlomp commented 1 year ago

Hi @Hihaatje! I don't think I saw feedback from you on other people's assignment 3 and READme file: did you respond with an emoji?

moengels commented 1 year ago

My readme file:

https://surfdrive.surf.nl/files/index.php/s/r4IFIQab2lAEiem

Hi @Hihaatje, quite extensive readme file, I suppose you have the setup on which you recorded the data somewhere documented and the parameters depicted are solely the relevant ones for the reconstructions?

Hihaatje commented 1 year ago

A bit late to the party, but:

Hi! Just checked your readme file - it's very clear to me. Just maybe a silly question: in your field, is the methodological information in the same format for other datasets? As I have some experience in CellProfiler in microscopy, it can generate meta information in the same format (mostly .XML), which is easy to manage.

No, there is no real standard here. Since this technology is still very much in development with no commercial products, standards are definitely lacking.

Hi @Hihaatje, quite extensive readme file, I suppose you have the setup on which you recorded the data somewhere documented and the parameters depicted are solely the relevant ones for the reconstructions?

The parameters are all the parameters which define the experimental conditions, outside of the setup. Indeed it might be wise to also include an overview of the current setup configuration in the readme, since things tend to change of course.

Hihaatje commented 1 year ago

Hi @Hihaatje! I don't think I saw feedback from you on other people's assignment 3 and READme file: did you respond with an emoji?

@EstherPlomp I just figured out what went wrong, I had a typo in my reply here which was on assignment 3 not on assignment 2. I've edited this now so it's correct. Is the feedback on the READme file okay like this as well?

EstherPlomp commented 1 year ago

Thanks for following up @Hihaatje! Yes - I'll mark things as completed from here 👍

EstherPlomp / TNW-RDM-101