arbeitsgruppe-digitale-altnordistik / Sammlung-Toole

A new look on Handrit.is data
https://arbeitsgruppe-digitale-altnordistik.github.io/Sammlung-Toole/
MIT License
0 stars 0 forks source link

Latest version extremely resource intensive #52

Closed kraus-s closed 2 years ago

kraus-s commented 3 years ago

When working with current 'prepareRC' it hogs unnecessary amounts of resources. Permanent use of 9-11GB of RAM and upwards of 20% CPU time seems a bit over the top/inefficient for what we are doing. Oh and try and explain this kind of permanent load to DASCH or whoever might be willing to host us in the future :D

kraus-s commented 3 years ago

Oh and it also makes everything very slow and unresponsive...

BalduinLandolt commented 3 years ago

I can have a look. Do you happen to have any idea, why this is the case? is something particular hogging all those resources?

kraus-s commented 3 years ago

I narrowed it down: It happens when I click on Show text matrix. It maxes out 1 core and slowly starts eating into the RAM until it goes OOM. Will not stop if tab closed, different function selected etc. Only way to stop it is to kill streamlit by closing the terminal.

BalduinLandolt commented 3 years ago

I see. My guess would be that the text*manuscript matrix bloats up in serialization. (Maybe it wouldn't be an issue if we could enable arrow - see separate issue coming right after.) and then you have this huge data in memory, in streamlit-cache and in the browser, or so.

in any case, I'd for now just deactivate the show text matrix button, as it's essentially useless anyways. The matrix is intended for internal lookups, not displaying, so it doesn't need to be sent to streamlit.

BalduinLandolt commented 3 years ago

some of the performance issues seem to happen earlier:

PerformanceWarning: DataFrame is highly fragmented.  This is usually 
the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. 
 To get a de-fragmented frame, use `newframe = frame.copy()`
  res[t] = False

is displayed after collecting the ms metadata. So this code seems to need streamlining too.

Also, a huge ammount of ram is used in the process of building up the data, because each xml is loaded, the string saved to the dataframe, soup built from the string and soup also stored in the dataframe. So this builds up a couple of gigs, that are released in the end, when contens and soup are being dropped from the df. This should be optimized too.

BalduinLandolt commented 3 years ago

actually, the above posted performance warning comes from the text-mss-matrix. should be solvable with #56

BalduinLandolt commented 3 years ago

however, loading all the XMLs and making soup from it still brings RAM up to 8.8GB. Loading metadata, persons and texts only brings it up to 9.2GB. When contents and soups are dropped, it goes down to idling at <3GB.
I think here is a lot of room for improvement.

BalduinLandolt commented 2 years ago

@kraus-s did you have a chance to monitor the resource consumption last time you re-built the database? can this be closed? or is it still an issue?

kraus-s commented 2 years ago

As far as I can tell, this is no longer an issue. I didn't monitor the memory usage last time, but it didn't take very long and probably won't go down unless we streamline the xml soup kitchen by i.e. batching, I guess.

kraus-s commented 2 years ago

As far as I can tell, this is no longer an issue. I didn't monitor the memory usage last time, but it didn't take very long and probably won't go down unless we streamline the xml soup kitchen by i.e. batching, I guess.