Closed obilodeau closed 3 years ago
https://ffmpeg.org/ffmpeg-formats.html#mov_002c-mp4_002c-ismv:
The mov/mp4/ismv muxer supports fragmentation. Normally, a MOV/MP4 file has all the metadata about all packets stored in one location (written at the end of the file, it can be moved to the start for better playback by adding faststart to the movflags, or using the qt-faststart tool). A fragmented file consists of a number of fragments, where packets and metadata about these packets are stored together. Writing a fragmented file has the advantage that the file is decodable even if the writing is interrupted (while a normal MOV/MP4 is undecodable if it is not properly finished), and it requires less memory when writing very long files (since writing normal MOV/MP4 files stores info about every single packet in memory until the file is closed). The downside is that it is less compatible with other applications.
Specifically:
it requires less memory when writing very long files
:open_mouth:
After video encoding changes but before I started backporting Alex's patches.
After a couple of minutes, the process started to consume around 800Mb. After 1 hour or so it was still around this use. After 3 hours it started increasing but sometimes it drops significantly.
Here's a screenshot after 4h30m at around 1.8GB RES:
Memory pressure is increasing. I still think we have a leak in the python code or we need to call the GC manually.
Houston, we have a leak
Killed the process. At least we know we fixed the corrupted MP4 issue:
Managed to have converted 5 hour of video.
I tried vprof
but it doesn't generate a full profile if the application doesn't terminate successfully. I used it on a smaller conversion job but the output isn't useful:
Then I tried memory_profiler
but all it shows is memory decreasing:
Line # Mem usage Increment Occurences Line Contents
============================================================
134 569.383 MiB -713.645 MiB 1029 @profile
135 def writeFrame(self):
136 569.383 MiB -413.707 MiB 1029 w = self.stream.width
137 569.383 MiB -413.707 MiB 1029 h = self.stream.height
138 569.383 MiB -387.512 MiB 1029 surface = self.imageHandler.screen.scaled(w, h) if self.scale else self.imageHandler.screen.copy()
139
140 # Draw the mouse pointer. Render mouse clicks?
141 569.383 MiB -401.105 MiB 1029 p = QPainter(surface)
142 569.383 MiB -401.340 MiB 1029 p.setBrush(QColor.fromRgb(255, 255, 0, 180))
143 569.383 MiB -401.340 MiB 1029 (x, y) = self.mouse
144 569.383 MiB -401.340 MiB 1029 p.drawEllipse(x, y, 5, 5)
145 569.383 MiB -401.340 MiB 1029 p.end()
146
147 # Output frame.
148 569.512 MiB -123.395 MiB 1029 frame = av.VideoFrame.from_ndarray(qimage2ndarray.rgb_view(surface))
149 569.512 MiB -1327.379 MiB 2017 for packet in self.stream.encode(frame):
150 569.512 MiB -767.219 MiB 988 if self.progress:
151 self.progress()
152 569.512 MiB -766.703 MiB 988 self.mp4.mux(packet)
In mprof mode, it's not more useful:
Going to try guppy next.
Gumpy is good but complicated:
Tried Pympler without great results. The TK reference browser didn't work. The summary views were very similar to gumpy's and didn't speak for themselves.
Tried fil-profiler. Segfaulted and trying to get a backtrace from gdb annoyed me greatly (args, python dev symbols, etc.) so I let it go.
Looked at valgrind and it felt like an adventure (compile python, using specific tooling to avoid noise in valgrind, etc.) so I skipped.
Tried tracemalloc. Adding instrumentation around the ReplayConverter
tight loop is what started giving tangible results (or so I hope they are reliable):
diff --git a/pyrdp/convert/ReplayConverter.py b/pyrdp/convert/ReplayConverter.py
index 0e66ccd..98f3a2b 100644
--- a/pyrdp/convert/ReplayConverter.py
+++ b/pyrdp/convert/ReplayConverter.py
@@ -26,8 +26,23 @@ class ReplayConverter(Converter):
print("The input file is already a replay file. Nothing to do.")
sys.exit(1)
- for event, _ in progressbar(replay):
+ import tracemalloc
+ snapshot1 = tracemalloc.take_snapshot()
+ for i, (event, _) in enumerate(replay):
+ #for event, _ in progressbar(replay):
+
handler.onPDUReceived(event)
+ if i % 100 == 0:
+ snapshot2 = tracemalloc.take_snapshot()
+ top_stats = snapshot2.compare_to(snapshot1, 'lineno')
+
+ print("[ 100 events later ... Top 3 memory differences ]")
+ #for stat in top_stats[:10]:
+ for stat in top_stats[:3]:
+ print(stat)
+ snapshot1 = snapshot2
+
+
print(f"\n[+] Succesfully wrote '{outputPath}'")
handler.cleanup()
and then calling the conversion using this shim:
#!/usr/bin/python3
from pathlib import Path
import tracemalloc
from pyrdp.convert.ReplayConverter import ReplayConverter
tracemalloc.start()
converter = ReplayConverter(Path('/home/olivier/Documents/gosecure/src/pyrdp/pyrdp_output/replays/rdp_replay_20210826_12-15-33_512_Stephen215343.pyrdp'),
'', 'mp4')
converter.process()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
With the above we get a top 3 changes to memory consumption in the event processing tight loop every 100 events. This gave the following trace (with many many lines omitted for brievety):
[ 100 events later ... Top 3 memory differences: ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:62: size=7511 KiB (+1945 KiB), count=1894 (+496), average=4061 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/orders/secondary.py:213: size=3753 KiB (+1458 KiB), count=947 (+248), average=4058 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/player/BaseEventHandler.py:234: size=47.6 KiB (+47.6 KiB), count=1 (+1), average=47.6 KiB
[ 100 events later ... Top 3 memory differences: ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:62: size=10.2 MiB (+2896 KiB), count=2628 (+734), average=4055 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/orders/secondary.py:213: size=4963 KiB (+1210 KiB), count=1314 (+367), average=3868 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/player/BaseEventHandler.py:236: size=56.6 KiB (+56.6 KiB), count=1 (+1), average=56.6 KiB
Line 62 of qt.py
is the bytearrray creation for the QImage buffer before it is being processed by the rle
C extension:
buf = bytearray(width * height * 2) # <-- this is line 62
rle.bitmap_decompress(buf, width, height, data, 2)
image = QImage(buf, width, height, QImage.Format_RGB16)
Why would this part of the code keep growing? If buf
is out-scoped, it should be garbage collected. Unless there's a reference to it that is held.
This rle
C extension is a part of the code that has been there forever, that has bugs that we worked around in Python and that didn't get any amount of scrutiny. It felt like this was the culprit. To make sure I commented it out. I knew this would yield to an empty video but would memory consumption change?
It did:
[ 100 events later ... Top 3 memory differences: ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:150: size=61.3 KiB (+4256 B), count=998 (+67), average=63 B
/usr/lib/python3.9/tracemalloc.py:132: size=3648 B (+3648 B), count=1 (+1), average=3648 B
/usr/lib/python3.9/tracemalloc.py:136: size=0 B (-3648 B), count=0 (-1)
[ 100 events later ... Top 3 memory differences: ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:150: size=62.7 KiB (+1344 B), count=1017 (+19), average=63 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:64: size=12.2 KiB (+1344 B), count=223 (+24), average=56 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:149: size=32.1 KiB (+704 B), count=514 (+10), average=64 B
Qt is no longer a top memory changer and there are no signs of uncontrolled memory growth.
I read the documentation on Python C Extensions and realized that we need to manage the lifetime of the buffer variables passed to the rle.bitmap_decompress()
function:
However, when a
Py_buffer
structure gets filled, the underlying buffer is locked so that the caller can subsequently use the buffer even inside aPy_BEGIN_ALLOW_THREADS
block without the risk of mutable data being resized or destroyed. As a result, you have to callPyBuffer_Release()
after you have finished processing the data (or in any early abort case).
-- https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple
Adding PyBufferRelease()
on the memory addresses of input
and output
led to the corrupted video issue. The memory increase disappers but visible artifacts appear.
Bad:
Good:
I thought using a passed buffer was not pythonic so I wrote an alternate implementation of the bitmap_decompress
function that would return a managed bytearray and that would consider input
(data) immutable. Here is the patch:
diff --git a/ext/rle.c b/ext/rle.c
index 17e3226..35dbdb2 100644
--- a/ext/rle.c
+++ b/ext/rle.c
@@ -24,6 +24,7 @@
/* indent is confused by this file */
/* *INDENT-OFF* */
+#define PY_SSIZE_T_CLEAN
#include <Python.h>
/* Specific rename for RDPY integration */
@@ -918,6 +919,41 @@ bitmap_decompress(uint8 * output, int width, int height, uint8* input, int size,
/* *INDENT-ON* */
+static PyObject*
+bitmap_decompress_wrapper_ng(PyObject* self, PyObject* args)
+{
+ unsigned char* input;
+ Py_ssize_t input_len;
+ int width = 0, height = 0, bpp = 0;
+ unsigned char* dest;
+ int dest_size;
+ PyObject *result;
+
+ if (!PyArg_ParseTuple(args, "y#iii", &input, &input_len, &width, &height, &bpp))
+ return NULL;
+
+ dest_size = width * height * bpp;
+ dest = PyMem_Malloc (dest_size);
+ if (dest == NULL)
+ {
+ return PyErr_NoMemory();
+ }
+
+ if(bitmap_decompress((uint8*)dest, width, height, (uint8*)input, input_len, bpp) == False) {
+ return NULL;
+ }
+
+ result = PyByteArray_FromStringAndSize(dest, dest_size);
+
+ PyMem_Free (dest);
+ if (result == NULL)
+ {
+ return PyErr_NoMemory ();
+ }
+
+ return result;
+}
+
static PyObject*
bitmap_decompress_wrapper(PyObject* self, PyObject* args)
{
@@ -936,6 +972,7 @@ bitmap_decompress_wrapper(PyObject* self, PyObject* args)
static PyMethodDef rle_methods[] =
{
{"bitmap_decompress", bitmap_decompress_wrapper, METH_VARARGS, "decompress bitmap from microsoft rle algorithm."},
+ {"bitmap_decompress_ng", bitmap_decompress_wrapper_ng, METH_VARARGS, "decompress bitmap from microsoft rle algorithm."},
{NULL, NULL, 0, NULL}
};
Remember that everytime you make a change to ext/rle.c
you need to recompile it with: pip3 install -U -e '.[full]'
. This alternate method doesn't work and leads to the exact same corruption as when I added the PyBuffer_Release()
.
This is where I go crazy: I added code that ran bitmap_decompress
and bitmap_decompress_ng
and checked whether the two resulting buffers were equal. They were always equal! Patch:
diff --git a/pyrdp/ui/qt.py b/pyrdp/ui/qt.py
index 05e07c4..64bc367 100644
--- a/pyrdp/ui/qt.py
+++ b/pyrdp/ui/qt.py
@@ -60,8 +60,15 @@ def RDPBitmapToQtImage(width: int, height: int, bitsPerPixel: int, isCompressed:
elif bitsPerPixel == 16:
if isCompressed:
buf = bytearray(width * height * 2)
+ # original
rle.bitmap_decompress(buf, width, height, data, 2)
- image = QImage(buf, width, height, QImage.Format_RGB16)
+ #image = QImage(buf, width, height, QImage.Format_RGB16)
+ # new
+ buf2 = rle.bitmap_decompress_ng(data, width, height, 2)
+ if buf != buf2:
+ # never triggered
+ import ipdb; ipdb.set_trace()
+ image = QImage(buf2, width, height, QImage.Format_RGB16)
else:
image = QImage(data, width, height, QImage.Format_RGB16).transformed(QMatrix(1.0, 0.0, 0.0, -1.0, 0.0, 0.0))
See the resulting video. I can share the replay file privately if there's interest.
https://user-images.githubusercontent.com/546325/131692190-7f078508-f43e-4868-a5de-879e6ef230d4.mp4
The eventual release of the buffer by python's GC seem to corrupt the GDI drawing cache. But how can this happen given that it's a new object (QImage) that is returned for caching!? I also tried to make an explicit full copy of the bytearray and this didn't work either. I'm out of options.
Maybe the rle
code is right and my changes introduced a small mistake. The leak isn't a leak since it's required by the GDI cache and the leak would be in the GDI's cache logic.
The buffer must remain valid throughout the life of the QImage and all copies that have not been modified or otherwise detached from the original buffer. The image does not delete the buffer at destruction. You can provide a function pointer cleanupFunction along with an extra pointer cleanupInfo that will be called when the last copy is destroyed.
from: https://doc.qt.io/qtforpython-5/PySide2/QtGui/QImage.html
Tried to get the cleanupFunction
working, it never worked. I don't think it's real for python. All I managed to do is segfault the interpreter. The documentation seems auto-generated from the C++ library so that would be an artifact of that.
I added a reference to the buffer to the Bitmap cache. Reverted the code to use the previous working rle
function. Video renders are ok but memory usage is still problematic.
After looking for known QImage memory leaks for a while, I decided to do a minimal test:
import rle
import tracemalloc
from PySide2.QtGui import QImage
tracemalloc.start()
snapshot1 = tracemalloc.take_snapshot()
width, height = (800, 600)
for i in range(1000):
buf = bytearray(width * height * 2)
#rle.bitmap_decompress(buf, width, height, bytearray(120), 2)
#image = QImage.fromData(buf, QImage.Format_RGB16)
image = QImage(buf, width, height, QImage.Format_RGB16)
if i % 100 == 0:
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
print("[ 100 events later ... Top 3 memory differences ]")
for stat in top_stats[:3]:
print(stat)
snapshot1 = snapshot2
Without the rle
call, the memory doesn't grow between the snapshots. However, with the rle
call it does. We are back to a problem in rle
.
Without rle
:
[ 100 events later ... Top 3 memory differences ]
/usr/lib/python3.9/tracemalloc.py:135: size=0 B (-100 B), count=0 (-2)
/usr/lib/python3.9/tracemalloc.py:193: size=4224 B (+96 B), count=88 (+2), average=48 B
/usr/lib/python3.9/tracemalloc.py:125: size=2088 B (+72 B), count=29 (+1), average=72 B
[ 100 events later ... Top 3 memory differences ]
/usr/lib/python3.9/tracemalloc.py:135: size=200 B (+200 B), count=4 (+4), average=50 B
/usr/lib/python3.9/tracemalloc.py:115: size=2400 B (+80 B), count=30 (+1), average=80 B
/usr/lib/python3.9/tracemalloc.py:125: size=2016 B (-72 B), count=28 (-1), average=72 B
[ 100 events later ... Top 3 memory differences ]
/usr/lib/python3.9/tracemalloc.py:135: size=100 B (-100 B), count=2 (-2), average=50 B
/usr/lib/python3.9/tracemalloc.py:558: size=1416 B (-96 B), count=28 (-2), average=51 B
/usr/lib/python3.9/tracemalloc.py:126: size=56 B (+56 B), count=2 (+2), average=28 B
With rle
:
[ 100 events later ... Top 3 memory differences ]
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:18: size=642 MiB (+91.6 MiB), count=1402 (+200), average=469 KiB
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:19: size=121 KiB (+17.3 KiB), count=1404 (+201), average=88 B
/usr/lib/python3.9/tracemalloc.py:558: size=1448 B (-96 B), count=29 (-2), average=50 B
[ 100 events later ... Top 3 memory differences ]
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:18: size=733 MiB (+91.6 MiB), count=1602 (+200), average=469 KiB
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:19: size=139 KiB (+17.2 KiB), count=1603 (+199), average=88 B
/usr/lib/python3.9/tracemalloc.py:126: size=196 B (+112 B), count=7 (+4), average=28 B
[ 100 events later ... Top 3 memory differences ]
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:18: size=825 MiB (+91.6 MiB), count=1802 (+200), average=469 KiB
/home/olivier/gosecure/src/pyrdp/bin/qimage.py:19: size=156 KiB (+17.3 KiB), count=1804 (+201), average=88 B
/usr/lib/python3.9/tracemalloc.py:125: size=2232 B (+144 B), count=31 (+2), average=72 B
With the reference to the buffer in the BitmapCache, I tried again the rle.bitmap_decompress_ng
hoping that the hold on that reference would prevent the corruption issues. It did! Videos are no longer corrupted!
Now, back to the initial memory issue, in the BitmapCache update I also added an explicit cache eviction on key updates. I'm not sure this is required. Let's benchmark.
Converting Stephen replay with BitmapCache (QImage, buffer), old rle
code and explicit eviction on duplication:
[ Top 10 ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:68: size=46.6 MiB, count=11910, average=4102 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/orders/secondary.py:213: size=17.5 MiB, count=5955, average=3085 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:148: size=90.8 KiB, count=1449, average=64 B
/usr/lib/python3.9/tracemalloc.py:558: size=74.7 KiB, count=1430, average=54 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/player/gdi/cache.py:71: size=59.3 KiB, count=426, average=142 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:147: size=45.9 KiB, count=731, average=64 B
/usr/lib/python3.9/tracemalloc.py:193: size=45.6 KiB, count=973, average=48 B
/usr/lib/python3.9/tracemalloc.py:115: size=38.0 KiB, count=487, average=80 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:70: size=27.3 KiB, count=499, average=56 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/fastpath.py:410: size=15.9 KiB, count=1, average=15.9 KiB
Converting Stephen replay with BitmapCache (QImage, buffer), new rle
code and explicit eviction on duplication:
[ Top 10 ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:72: size=4055 KiB, count=1005, average=4131 B
/usr/lib/python3.9/tracemalloc.py:558: size=67.1 KiB, count=1283, average=54 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/player/gdi/cache.py:71: size=59.2 KiB, count=425, average=143 B
/usr/lib/python3.9/tracemalloc.py:193: size=46.0 KiB, count=982, average=48 B
/usr/lib/python3.9/tracemalloc.py:115: size=37.9 KiB, count=485, average=80 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:148: size=27.4 KiB, count=436, average=64 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:73: size=27.3 KiB, count=499, average=56 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/fastpath.py:410: size=15.9 KiB, count=1, average=15.9 KiB
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:147: size=13.8 KiB, count=219, average=65 B
/usr/lib/python3.9/abc.py:123: size=9766 B, count=97, average=101 B
A drastic reduction on top 2 consumers. A 15x reduction in memory allocations for this conversion job!
Converting Stephen replay with BitmapCache (QImage, buffer), new rle
code and without the explicit eviction on duplication:
[ Top 10 ]
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:72: size=4055 KiB, count=1005, average=4131 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:148: size=90.8 KiB, count=1449, average=64 B
/usr/lib/python3.9/tracemalloc.py:558: size=74.3 KiB, count=1421, average=54 B
/usr/lib/python3.9/tracemalloc.py:193: size=45.9 KiB, count=980, average=48 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/convert/MP4EventHandler.py:147: size=45.9 KiB, count=731, average=64 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/player/gdi/cache.py:71: size=41.2 KiB, count=425, average=99 B
/usr/lib/python3.9/tracemalloc.py:115: size=37.7 KiB, count=483, average=80 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/ui/qt.py:73: size=27.3 KiB, count=499, average=56 B
/home/olivier/Documents/gosecure/src/pyrdp/pyrdp/parser/rdp/fastpath.py:410: size=15.9 KiB, count=1, average=15.9 KiB
/usr/lib/python3.9/abc.py:123: size=9759 B, count=97, average=101 B
Python does its job. Explicit eviction is not required. Dropping that code.
The code is in #353. I'm testing with a jumbo replay now.
After 3h50m of processing, RES memory is still only at 685M. I think we plugged the leak.
However, there seems to be a regression in the CPU consumption...
A couple of minutes later, processing picked up and ETA looked better:
After 6 hours, the pace of processing events picked up. Memory usage is still low. Partial video viewable for at least 3 hours.
The conversion of an 11-hours long replay to MP4 completed succesfully this morning. I can watch the video, seek throught it. The conversion took 13.75 hours (21 CPU hours!) and I never saw RAM usage go beyond 685MB of RAM. The capture is 973MB and the resulting video is 3.4GB.
The player seem to have benefited from these improvements as well. However, seeking past 6h can get slow until you get to recoverable UI freezes. Once caught up, scaling + 10x speed replaying was smooth. RAM usage below 400MB.
Fixed with the merge of #353.
I can reliably make the
pyrdp-convert.py
tool go out of memory:My OOM logs say:
So 10GB of RAM for a 1GB replay file:
@alxbl created patches to mitigate these issues. I'll take a look at them.
Additionally, the converted MP4 is unplayable, its
moov
atom was not written out to disk so the resulting file is useless. I will try to address that problem separately by using a streaming container format instead of a fixed MP4.