FAU-CS6 / KDD

Lecture and exercise of "Knowledge Discovery in Databases"
GNU General Public License v3.0
22 stars 11 forks source link

Exercise clustering #59

Closed dominik-probst closed 2 years ago

dominik-probst commented 2 years ago

Closes #40

dominik-probst commented 2 years ago

Thanks for your short review. We had already discussed that a long one won't be possible.

In this case, however, I disagree with your three proposals, which I would like to explain briefly here (I'll stick to your numbering here):

  1. Dividing the first notebook into several single notebooks would have the disadvantage that there would suddenly more than one notebook per exercise week (with exclusion of the first optional exercise sheet all exercise sessions have been based on a single exercise sheet). This would most likely lead to misunderstandings. Also, I doubt that students will have problems with the length of the notebook (no matter if with or without TOC), but I will ask for feedback on this in the next exercise week and based on that, possibly split up several notebooks for next semester.
  2. I have already told the students the reason for these additional cells on one of the previous exercise sheets: They simply serve to prevent accidental glancing at the subsequent step-by-step tasks. From the one student I know who had chosen a free task, there was also explicitly positive feedback on this (he needed far fewer cells at that time, but did not find it confusing). If there is different feedback from students in the future, I will of course reconsider this procedure.
  3. Here, of course, it would be nice if our practice sheets aligned. However, I am not convinced that a fit/predict structure is the right design for KDD exercise sheets. From my point of view, the goal of our exercise sheets is to get students to take a closer look at the procedures presented in the lecture. While a fit/predict structure is certainly one of many interesting approaches from a software architecture point of view, subtleties in software design contribute little to the goal I formulated above. In contrast, it has several disadvantages: First, it makes the subdivision of algorithms into step-by-step tasks much more difficult, since a class must always be in a single code cell (not to mention complicated inheritance workarounds that are hard for students to understand). Secondly, in my opinion, such a rigid structure discourages even more students from doing the exercises who have little programming experience. In summary: Yes, a fit/predict structure is a cool thing and I understand why you chose exactly this paradigm, but I don't see enough benefits towards the goal I see behind the KDD exercises and too many drawbacks.

At this point, however, also something that you have not noted, but which I myself will definitely adopt as a planned improvement in the coming semester: The assert test cases you used on your practice sheet. These will certainly help students to develop more confidence in their own implementations and are definitely something to come for this reason. However this is something for the next semester, as I want to see all my exercise sheets adapted here equally.

melsigl commented 2 years ago

I strongly disagree with you.

  1. Regarding splitting notebooks and not confoming with the number of exercise weeks could be solved in comunicating it clearly in the sheets. Additionally, splitting notebooks based on the algorithms and thus, separating the concerns, namely implementing a specific algorithm in a sub-exercise sheet will ensure that there is no confusion or accidential modification of some funcitonality. It merely implements the principle of separation of concern.
  2. As I mentioned this proposal is for you to decide. I will not yet again reiterate my opinion on that matter here.
  3. I strongly disagree with you on the point that a class will make the exercises more rigid and complicated. Reasons are:
    1. sklearn already follows this structure and it is easy to use. I twill not enforce an architecture style but merely a programming paradigm (object oriented vs. imperative) that is easy to follow. Introducing classes and objects with fit/pedict functions will introduce a procedure to follow. As mentioned, sklearn already follows this paradigm. Implementing DBSCAN for instance from scratch and then compare this result by using sklearn's implementation uses two paradigms. Introduction of this approach will not by any means shift the focus away from implementing the presented content discussed in our lectures. Additionally, it will provide and prepare our notebooks later for any automatic tests if we wish to provide and use this proposed API.
    2. No inheritance is needed as Python supports duck-typing. A class simply has to provide the right functions. Albeit this concept may seem strange coming from another object-oriented programming language, it is not hard to understand, but merely simple to follow. In our lecture we already discussed that algorithms first require training before using them. Therefore, our students are capable to understand that a class that provides these functions clearly follow this procedure. Both courses of study, Data Science and Computer Science Bachelor, are attending the lecture Algorithms and Data Structures. Therefore, they should be familiar with object-oriented programming and that's all we ask here. We do not assume some Pytohn specialities here.
    3. Idealy, yes a class is defined in a single code cell. Yet, in Python this is not necessary. As the lecture and also the exercise notebook progresses, it is possible to add new functions to a class in another cell. You can also find this in a tutorial provided by TensorFlow.
    4. Implemented functions are using the same parameters over and over again, which rely on passing the same parameter. For instance, k-Means requires the parameter k. In this JupyterNotebook all function calls obtain a value for this parameter k. This parameter, however, is something that should by no means change during the course of training. Therefore, it is an object variable. Yet, in all function calls this parameter is individually chosen - all set to the number 2, but no variable is introduced at the beginning of this exercise notebook.
    5. Additionally, I highly doubt that introducing classes in our lectures will further decrease the exercise participation. There are always students that will not attend exercises. Likewise, there are always students that will not implement any of these functions. However, at some point, a Data Scientist has to have at least some programming knowledge because a prototype has to put into production. As I mentioned in 3.ii, our students attended AuD and therefore are capable in following this simple paradigm. They may just need more exercise to get accustomed to programming. We do not expect them to put together a Python module in our lecture and exercise, but following the implementation of some class functions is not asking too much of them.
    6. When students seek to pursue their project or thesis with us, we cannot expect them to suddenly know how to program or structure their code when lecture exercises and projects/thesis introduce different standards.

In summary, I want to emphasise that the last point is out of scope for this semester due to time constraints and I did not, by any means, want or wanted to make the impression that implementing this proposal for this semester (for next week's exercise to be exact) is obligatory. It is merely something to incorporate next semester.

dominik-probst commented 2 years ago

The review shows that you don't share the same opinion. And that's completely okay.

  1. Yes most people will hopefully read it that two parts are necessary, however it is also to be expected that many will probably not even read the first paragraph on the structure of the exercise sheet. Even with the first optional exercise sheet where the pattern didn't even exist in the minds of the students, there was one person who had simply overlooked part b. Another disadvantage I see in splitting the exercise into several parts is a very practical one, which is based on the way I do the exercise. Here I have to switch between three tabs in the presence exercise anyway. With a split, there would be five. In the end, it's the will of the reviewer versus the will of the creator. Normally, it would definitely be important to come to a mutually satisfactory solution here, but we are already much too late with the upload. For this reason, I would ask that we could postpone this point of discussion until next semester and stick with my solution for the time being. In the end, I am the person who would have to live with it if more students show up to the exercise only partially prepared.
  2. -
  3. I honestly hadn't thought of the possibility of passing a reference to the function until now. This at least partially invalidates the point about the difficult step-by-step implementation. Regarding many of your points, however, I am of a different didactic opinion. As you said though: let's discuss this at an other time and place.
melsigl commented 2 years ago

Regarding point 1: If students do not read the preamble, they may also skim over the task description. We cannot avoid that for all students. Also, when we mention the nature in our lecture, not all students will join our lecture sessions, not all students will listen equally (partially because it's at the end of the lecture and attention may wear thin), some may even forget it until they download our exercises or until they join our exercise sessions, and others may forget it altogether. We cannot account for every single event that may or may not occur.

Additionally, scrolling through the notebook to find the tasks they should solve (Option 1: Implement some algorithms on their own vs. Option 2: Implementing said algorithm in a guided manner, and then there's also an "option 3" which only displays using sklearn's function) may also contribute to missing some tasks, just like the one student you are referring to even though in this case a whole separate notebook has been overlooked. This is indeed unfortunate, yet no hard setback as this concerned the very first exercise week which solely covered an introduction to Python and pandas. The student had still the opportunity to catch up on that exercise if he/she requires or seeks to do so. We have no bonus points whatsoever employed this semester and, in the event of introducing such a bonus-system in the future, I highly doubt we will not employ bonus points on this very exercise.

To sum up, I see the following advantages from a student's perspective:

Regarding point 3: I don't think that we have such a distinct didactic opinion. We both want the students to focus on and understand the algorithms and not simply learn them blindly by heart.

My take here is that introducing an object-oriented programming paradigm and thus, inherently providing an API that follows a fit/predict structure, comes natural when the majority of algorithms we discuss follows exactly this procedure. Algorithms that do not use a predict-function can simply leave out the implementation thereof.

The exercise notebook of A Priori for instance, already introduces and uses objects to model data structures for an Itemset, and ItemsetList. Converting and moving these variables together with functions like generate_candidates and scan_candidates to an APriori class such as to replace the wrapper function a_priori comes more natural as all necessary and used functions and propagated variables are in one place, namely in one A Priori object that has been trained on a specific dataset. I believe it is possible to reflect the same guided walk-through that you used in your notebooks by introduing a class with a fit/predict API (such as in the A Priori example: first implement prune_itemsets, generate_candidates, then scan_candidates and then use these funcitons to build the fit-function) without jeopardizing your take on focussing on the algorithm and convey exactly the details of said algorithm in our exercises. This, in my point of view, would align with your and my didactic opinion.

Converting the existing notebooks to follow this procedure is of course time-consuming, but then again creating the exercises in the first place consumed a tremendous amount of time. Converting them would in no way invalidate the initial time and effort spent, quite on the contrary. If you like, I could prepare a version of your exercise that uses a class fit/predict approach to evaluate how far these two exercise versions would then diverge. Likewise, I offer my time in converting them so that you are relieved from at least some of the workload ahead. Additionally, I highly appreciate the continuous effort in making this lecture happening and even extend it with an exercise.

Wrap Up We already agreed on not changing this specific exercise for this current semester. Therefore, as agreed I will

  1. open an issue to resolve this ongoing discussion, and
  2. accept and merge this PR without any in-depth review from my side (due to time constraints) and any changes from your side (due to my delayed brief review and time constraints regarding the upcoming exercise session).