EstherPlomp / TNW-RDM-101

Self paced materials of the RDM101 course
https://estherplomp.github.io/TNW-RDM-101/
Creative Commons Attribution 4.0 International
6 stars 2 forks source link

Assignment 1 Giacomo Lastrucci #83

Closed giacomolastrucci closed 11 months ago

giacomolastrucci commented 1 year ago

Introduction

Hi all, my name is Giacomo, I am Italian and recently started a PhD at TU Delft in Chemical Engineering. I will be working on applying AI and machine learning in the process industry. I am passionate about travelling, hiking and cooking!

Describe your research in 2-3 sentences to someone that is not from your field (please avoid abbreviations)

During my research, I will be working on an alternative modeling technique for chemical processes, placed between traditional mechanistic modeling and novel machine learning methods. Recent deep learning models are able to extract features and learn from massive amounts of data without knowing any physical law. I will try to give these models the ability to learn the physics behind the problems to increase their prediction accuracy.

My research entails the following aspects:

Research Aspect Answer
Use/collect personal data (health data, interviews, surveys) No
Use/collect experimental data (lab experiments, measurements with instruments) No
Collaborate with industry Yes
Write/develop software as the main output of the project Yes
Use code (as in programming) for data analysis Yes
Work with large data (images, simulation models) Yes
Other: N/A

Reflections on the importance of RDM videos

I completely agree with the importance of proper data management. Despite can be a little of a pain at the beginning to find the proper organizational setup, it is something that pays in the long term. Apart from "dramatic" horror stories, on a daily basis, having a proper data management setup can save you a considerable amount of time. Having an easy retrieval of data, active backups and proper storing is of primary importance in the research field.

What would you like to learn during this course?

Get the best practices for collecting experimental outcomes from my research (e.g., training runs, employed hyperparameters and network architecture), appropriately store them and retrieve them at a late stage. Also, I would like to get more into how to properly store huge amounts of large data as images.

Checklist assignments

EstherPlomp commented 1 year ago

Copied from #97:

Hello everyone, please, find here my data flowmap. Feel free to leave any comments or feedback. Thank you!

Best, Giacomo

With feedback from @PJL-vandenberg:

Hey Giacomo,

Looks good! I like the division between the confidential data and the open data. Including training sets that can be public seems like a great idea for transparency. The only question I have is: "What are your plans for code documentation, to prevent bits of code that make sense to you when documented in your way, but might not to your successor?" Because it has happened to me before that somebody handed me 50 lines of code commented with "This makes the dataset from the model data", because the steps made sense to him.

giacomolastrucci commented 1 year ago

Hi @PJL-vandenberg,

Thanks for the feedback. Of course code documentation is crucial for successors, but also for yourself once havening to go through the code after a while. I plan to try to follow some basic good practices in code developing and I guess will be important also to insert comment "line-by-line" whenever needed to understand complicate steps. Thanks again!

EstherPlomp commented 1 year ago

Thanks @giacomolastrucci for sharing assignment 2 - it looks great, well done!

Thanks also @PJL-vandenberg for sharing the struggle you had to go through! Hopefully this will prevent a repeat for any successors the both of you have! The CodeRefinery materials have some pointers for documentation which may help.

eugenhu commented 1 year ago

Hi Giacomo,

It was interesting to read through your datasets in your assignment 3, thanks for sharing. Everything looks great, but here is some feedback you can consider:

EstherPlomp commented 1 year ago

Thanks for sharing assignment 3 @giacomolastrucci, and for updating the information on the file formats as well! It looks super clear and well structured so I have very little feedback, well done!

Fair enough that you'll need to discuss the data publication part with your supervisor. If you plan to share scripts/code via GitHub, do also share a snapshot of that via a data repository. GitHub is a great place to share code and to collaborate on it with others, but it doesn't have a long term preservation policy and it doesn't assign DOIs. Data Repositories like Zenodo or 4TU.ResearchData do have this, and Zenodo has a nice integration with GitHub. See here for more info: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

And I didn't see a software license preference in your assignment, but we discussed that in #66 https://github.com/EstherPlomp/TNW-RDM-101/discussions/66#discussioncomment-7158917

Thanks for your feedback and also for sharing Papers with Code @eugenhu! I'll check this out as well!

giacomolastrucci commented 1 year ago

Hi @eugenhu , thank you a lot for your comments. Especially, thanks for suggesting going through Papers with Code, I did already know it but I definitely need to check it out in more detail. Also, I will consider using binary files like .npy instead of .csv for storing datasets, even though the latter is even "more" open (for instance for people not using python). For the data dictionary, I don't have a strong opinion yet. I think both can be suitable.

giacomolastrucci commented 1 year ago

Hi @EstherPlomp, thanks a lot for your feedback.

Thanks for suggesting Zenodo, I think it can be really suitable for my purposes, especially thanks to the integration with GitHub.