[x] 1. In the update function there is this multiplication by 1 that we discussed at the very beginning, where you mentioned this could be used if you have data that arrives in packets, or something similar. Is it required to leave it in the code or should I remove it since it introduces an unnecessary multiplication in my use-case and is not going to be part of the benchmarking?
[x] 2. I have multiple files that are not shared on the github repo - the large input data files (~10GB) and the profiling reports from VTUNE (~20GB). Do I have to leave sources for these in the thesis, and if yes, how?
Writing
[x] 3. In your opinion, what could be a good outline of the problem statement for my thesis? (Do I talk about the need for more throughput as datastreams grow even more, do I talk about how vectorization works and what needs to be considered (e.g. compute bounds) before implementing it, or do I talk about something else?)
[x] 4. Do I provide pseudo-code for the join/self_join functions for both algorithms in the description section of those algorithms, or just for the update functions?
[ ] 5. In which part does it make most sense to discuss the baseline benchmark results (with graphs etc)? I was initially planing to discuss them after I introduce the test conditions under Approach, e.g. after talking about implementation and tools, but now I'm thinking it might be that all results must go under Evaluation? Please let me know
This depends on how your stream looks like. Sometimes it's just a stream of keys, sometimes it's a key-frequency tuple. If you have only implemented the former case that's perfectly fine and you don't have to mention the latter.
For the input files: You have probably generated them using some script. The script is good enough. For the VTune Reports: No need to upload these huge reports. I'm perfectly fine with the command you used to create them.
The problem is high throughput sketch maintenance. If it's just about making clear what the problem is and why it's important, I would go for the growing data streams. Vectorization and everything related to it is the way you approach the problem.
Clear before benchmarking any throughput
Writing
[x] 3. In your opinion, what could be a good outline of the problem statement for my thesis? (Do I talk about the need for more throughput as datastreams grow even more, do I talk about how vectorization works and what needs to be considered (e.g. compute bounds) before implementing it, or do I talk about something else?)
[x] 4. Do I provide pseudo-code for the join/self_join functions for both algorithms in the description section of those algorithms, or just for the update functions?
[ ] 5. In which part does it make most sense to discuss the baseline benchmark results (with graphs etc)? I was initially planing to discuss them after I introduce the test conditions under Approach, e.g. after talking about implementation and tools, but now I'm thinking it might be that all results must go under Evaluation? Please let me know