betagouv / analyse-flux-insertion

Outil d'analyse des flux et échanges de données dans le domaine de l'insertion
2 stars 1 forks source link

fix(lecteur): Handle large files processing #47

Closed aminedhobb closed 3 years ago

aminedhobb commented 3 years ago

Main Objectives

In this PR I add the ability to read large files (> 500 Mo) from our app to let departments compute stats from their monthly flux. To do so I read large file by chunks and not in their entirety. This is possible thanks to the readAsArrayBuffer method that lets you read files by blobs. All the file blobs we read are of the same size (512 * 1024 bytes). However we don't keep all the text that contains those chunks, we only extract the info from one application (<InfosFoyerRsa>...</InfosFoyerRsa>) and retrieve data from it. The next chunk will start from where this processed application ended. This is described in this drawing:

image

Also due to the async nature of the JS FileReader API, I put this logic inside a Promise to process one chunk after the other synchronously.

Benchmark:

On Google Chrome, I was able to process a real flux of 783.4 Mo in 1181 seconds (~20 min). This could be improved if needed by processing several applications in one chunk.

Side note

I took this opportunity to correct the miscalculations we were having described in #45 that I introduced by fetching the ETATDORSA value on the applicant level instead of the application level here.