Closed kraus-s closed 2 years ago
Oh and it also makes everything very slow and unresponsive...
I can have a look. Do you happen to have any idea, why this is the case? is something particular hogging all those resources?
I narrowed it down: It happens when I click on Show text matrix. It maxes out 1 core and slowly starts eating into the RAM until it goes OOM. Will not stop if tab closed, different function selected etc. Only way to stop it is to kill streamlit by closing the terminal.
I see. My guess would be that the text*manuscript matrix bloats up in serialization. (Maybe it wouldn't be an issue if we could enable arrow - see separate issue coming right after.) and then you have this huge data in memory, in streamlit-cache and in the browser, or so.
in any case, I'd for now just deactivate the show text matrix
button, as it's essentially useless anyways. The matrix is intended for internal lookups, not displaying, so it doesn't need to be sent to streamlit.
some of the performance issues seem to happen earlier:
PerformanceWarning: DataFrame is highly fragmented. This is usually
the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead.
To get a de-fragmented frame, use `newframe = frame.copy()`
res[t] = False
is displayed after collecting the ms metadata. So this code seems to need streamlining too.
Also, a huge ammount of ram is used in the process of building up the data, because each xml is loaded, the string saved to the dataframe, soup built from the string and soup also stored in the dataframe. So this builds up a couple of gigs, that are released in the end, when contens and soup are being dropped from the df. This should be optimized too.
actually, the above posted performance warning comes from the text-mss-matrix. should be solvable with #56
however, loading all the XMLs and making soup from it still brings RAM up to 8.8GB. Loading metadata, persons and texts only brings it up to 9.2GB. When contents and soups are dropped, it goes down to idling at <3GB.
I think here is a lot of room for improvement.
@kraus-s did you have a chance to monitor the resource consumption last time you re-built the database? can this be closed? or is it still an issue?
As far as I can tell, this is no longer an issue. I didn't monitor the memory usage last time, but it didn't take very long and probably won't go down unless we streamline the xml soup kitchen by i.e. batching, I guess.
As far as I can tell, this is no longer an issue. I didn't monitor the memory usage last time, but it didn't take very long and probably won't go down unless we streamline the xml soup kitchen by i.e. batching, I guess.
When working with current 'prepareRC' it hogs unnecessary amounts of resources. Permanent use of 9-11GB of RAM and upwards of 20% CPU time seems a bit over the top/inefficient for what we are doing. Oh and try and explain this kind of permanent load to DASCH or whoever might be willing to host us in the future :D