ThomasFaria / retex-innovation-insee

https://thomasfaria.github.io/retex-innovation-insee/authorsample.pdf
0 stars 4 forks source link

about the coupling between compute and storage #15

Closed RLesur closed 4 months ago

RLesur commented 4 months ago

On first reading, the part about Tigani's article and Insee's experiments appeared to me as contradictory.

It's written:

even substantial data processing jobs may end up using “far less compute than anticipated [...] and might not even need to use distributed processing at all”.

and further down:

Interestingly, subsequent projects involving large datasets didn’t suffer much from this change, as their needs were actually very much in line with Tigani’s observations: the performance bottleneck for these projects was generally on the side of computational needs rather than storage capacity, making Hadoop-style clusters less relevant

I understand that the subject is the constraint posed by the coupling, but the two examples are in opposite directions, which makes reading a bit confusing.

mpjoubertdebellefon commented 4 months ago

Yes I also didn't understand the sentence "subsequent projects involving large datasets didn't suffer much from this change" -> why would they suffer from "this change", if it is the one described above, which is supposed to be something positive ?

avouacr commented 4 months ago

Maybe it would be clearer if I just remove the phrase altogether, giving :

Despite this increase in performance, this type of architectures were not reused later for other projects, mainly because the architecture proved to be expensive and complex to maintain, necessitating specialized technical expertise rarely found within NSOs \cite{vale2015international}. Although these new projects could still involve substantial data volumes, we observed that effective processing could be achieved using conventional software tools (R, Python) on single-node systems by leveraging recent important innovations from the data ecosystem.

The point being : Hadoop cluster were hard and costly to maintain -> not reused in following projects involving large data -> switching back to single-node architectures didn't hurt performance, because we had more mature single-node techno (parquet, arrow etc)

What do you think @RLesur @mpjoubertdebellefon ?

RLesur commented 4 months ago

LGTM