Bladieblah / xpdf-python

Python wrapper around the pdftotext functionality of xpdf
GNU General Public License v3.0
2 stars 2 forks source link

Memory Leak in PdfLoader #23

Open ReMiOS opened 11 months ago

ReMiOS commented 11 months ago

I've noticed a fast increase in used memory when using PdfLoader To reproduce this issue the minimal code below can be used. On both Windows and Ubuntu i noticed similar results.

I've tried the Garbage Collector to free up elements, but without result.

Can you please assist on how to free the memory after loading PdfLoader ?

minimal code to reproduce:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from xpydf.pdf_loader import PdfLoader

while True:
    loader =  PdfLoader( 'Test.pdf' )

Below the results on Windows Screenshot 1 with just python loaded, Screenshot 2 when test starts, Screenshot 3 after 20 seconds Memory_Leak

Same results on Ubuntu (but less fast since it's a light system with a slower CPU) RSS SZ VSZ 720 1298 5192 3040 1891 7564 20.080 47.376 189.504 <- test starts 28.024 49.878 199.512 32.556 50.596 202.384 32.820 50.695 202.780 33.084 50.761 203.044 .. snip ... 240.356 102.577 410.308 240.620 102.643 410.572 240.884 102.742 410.968 241.412 102.808 411.232 241.676 102.907 411.628 <- after 3 minutes

RSS: resident set size, the non-swapped physical memory that a task has used (in kiloBytes). SZ: size in physical pages of the core image of the process. This includes text, data, and stack space. VSZ: virtual memory size of the process in KiB (1024-byte units).

ReMiOS commented 11 months ago

I've done some further analysis, but haven't found the cause ....

from psutil import Process from xpydf.pdf_loader import PdfLoader import gc Process().memory_info().rss 32260096

loader = PdfLoader( 'Test.pdf' ) Process().memory_info().rss 34557952

del loader Process().memory_info().rss 34557952

gc.collect() 0 Process().memory_info().rss 34623488 ( => + 2.363.392 bytes )

Bladieblah commented 11 months ago

I've done some pretty extensive searching for memory leaks back in april, I'm 100% sure there are none in the C++ part of the implementation. I was also never able to find a leak in the wrapper but I did have suspicions there might be one. When I have time I'll have another look though! If you want to continue testing I can recommend Scalene for memory profiling

ReMiOS commented 11 months ago

Thanks for the tip, i will look into Scalene

I suspect the numpy library in the pdf_loader.py causes the memory usage, but i am not sure.

Update: I've removed all numpy parts (including extractImages and pageToImage parts) from the source for testing ( and re-compiled / installed ) But this does not fix the issue. I will investigate further ...

ReMiOS commented 11 months ago

Unfortunately Scalene does not support memory profiling on a Windows system But on Ubuntu i've got the following results:

~/xpdf-python# scalene test.py
                       Memory usage: ▁▁▂▂▂▂▃▃▃▃▄▄▅▅▅▅▅▅▆▆▆▆█████ (max: 284.767 MB, growth rate: 101%)
                       /root/xpdf-python/test.py: % of time = 100.00% (2m:56.853s) out of 2m:56.853s.
       ╷       ╷       ╷       ╷        ╷       ╷               ╷       ╷
       │Time   │–––––– │–––––– │Memory  │–––––– │–––––––––––    │Copy   │
  Line │Python │native │system │Python  │peak   │timeline/%     │(MB/s) │/root/xpdf-python/test.py
╺━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
     1 │       │       │       │        │       │               │       │#!/usr/bin/env python3
     2 │       │       │       │        │       │               │       │# -*- coding: utf-8 -*-
     3 │       │       │       │        │       │               │       │
     4 │       │       │       │        │       │               │       │import time
     5 │       │    1% │       │ 100%   │   10M │▁   4%         │       │from xpydf.pdf_loader import PdfLoader
     6 │       │       │       │        │       │               │       │
     7 │       │       │       │        │       │               │       │idx=0
     8 │       │       │       │        │       │               │       │while True and idx < 100000:
     9 │   58% │    5% │  36%  │   1%   │  272M │▁▂▂▃▄▅▆▇█  96% │     6 │    loader =  PdfLoader( 'Test.pdf' )
    10 │       │       │       │        │       │               │       │    idx += 1
    11 │       │       │       │        │       │               │       │
       ╵       ╵       ╵       ╵        ╵       ╵               ╵       ╵
Top AVERAGE memory consumption, by line:
(1)     9:   272 MB
Top PEAK memory consumption, by line:
(1)     9:   272 MB
(2)     5:    10 MB
Possible memory leak identified at line 9 (estimated likelihood:  95%, velocity:   2 MB/s)
ReMiOS commented 10 months ago

Hi Matthijs,

I am a bit stuck on finding why there's a memory increase when calling PdfLoader multiple times. (also because i have little experience with C++) Do you have any tips for me ?

Is there a way to test the memory before and after calling the "deleteObject" method in the PdfLoaderWrapper ?

Update: I've added a "cout" in the "PdfLoaderWrapper deleteObject" method to see if the delete is called

    if (loader) {
        delete loader;
        cout << "DELETE_LOADER\n";
    }

    void *context = PyCapsule_GetContext(loaderCapsule);

    if (context) {
        delete context;
        cout << "DELETE CONTEXT\n";
    }

$ loader = PdfLoader( 'Test.pdf' ) $ Process().memory_info().rss 34971648 $ del loader DELETE_LOADER $ Process().memory_info().rss 34992128

I've noticed a memory increase after the first time the loader is deleted.. The "delete context" isn't called .... strangly enough.. removing the "delete loader" statement does not effect the memory usage...

ReMiOS commented 10 months ago

Just a thought ... Maybe a "PyCapsule_Destructor" could be a solution (but it seems to make no difference)

// Deallocator for the PyCapsule.
void free_capsule(PyObject *loaderCapsule) {
    PdfLoader *loader = (PdfLoader *)PyCapsule_GetPointer(loaderCapsule, "loaderPtr");
    // Use PyMem_Free to free the allocated memory.
    PyMem_Free(loader);
}

PyObject *loaderCapsule = PyCapsule_New((void *)loader, "loaderPtr", free_capsule);
Bladieblah commented 9 months ago

Alrighty, I expanded your script a bit, see attachment test_memory.txt

And this is the output: ` # Iterations Objects Physical mem Virtual mem


         0      24325        44777472   419310321664
       200      24326        46153728   419587145728
       400      24327        47267840   419721363456
       600      24328        47874048   419721363456
       800      24329        48398336   419721363456
      1000      24330        48906240   419721363456
      1200      24331        49676288   419721363456
      1400      24332        50184192   419721363456
      1600      24333        50626560   419721363456
      1800      24334        51150848   419721363456
      2000      24335        51609600   419721363456
      2200      24336        52084736   419721363456
      2400      24337        53100544   419721363456
      2600      24338        53559296   419721363456
      2800      24339        54034432   419721363456
      3000      24340        54525952   419721363456
      3200      24341        55001088   419721363456
      3400      24342        55525376   419721363456
      3600      24343        56000512   419721363456
      3800      24344        56492032   419721363456
      4000      24345        56918016   419721363456
      4200      24346        57442304   419721363456
      4400      24347        57933824   419721363456
      4600      24348        58425344   419721363456
      4800      24349        59949056   419721363456
      5000      24350        60506112   419721363456
      5200      24351        60964864   419721363456
      5400      24352        61440000   419721363456
      5600      24353        61915136   419721363456
      5800      24354        62390272   419721363456
      6000      24355        62881792   419721363456
      6200      24356        63455232   419721363456
      6400      24357        63930368   419721363456
      6600      24358        64389120   419721363456
      6800      24359        64913408   419721363456
      7000      24360        65372160   419721363456
      7200      24361        65863680   419721363456
      7400      24362        66338816   419721363456
      7600      24363        66797568   419721363456
      7800      24364        67289088   419721363456
      8000      24365        67796992   419721363456
      8200      24366        68272128   419721363456
      8400      24367        68747264   419721363456
      8600      24368        69206016   419721363456
      8800      24369        69697536   419721363456
      9000      24370        70172672   419721363456
      9200      24371        70680576   419721363456
      9400      24372        73252864   419721363456
      9600      24373        73760768   419721363456
      9800      24374        74252288   419721363456
     10000      24375        74711040   419721363456
     10001      24041       119554048   419754065920`

As you can see, memory usage definitely increases with the number of iterations. However, the number of allocated objects remains constant. The single increase in the object count for each row in the table is actually the table itself growing (I'm 99% sure). That suggests that there are no NEW objects after each iteration.

From both observations there are 3 possible conclusions:

  1. The increased memory usage is not allocated memory but reserved memory. This would mean that the program is not using previously freed memory for new allocations, instead requesting new memory from the OS. This would obviously be really weird...
  2. There is a memory leak outside of python (but in the c++ part), so we cannot detect it with this script. That's also unlikely because I previously checked that there is no leak, by only running the cpp part of the code. I guess I'll have to check again to rule this one out
  3. I just have no clue what I'm doing and all of the above is meaningless. I hope this isn't the case 🥲
Bladieblah commented 9 months ago

Welll shit looks like there IS actually a leak in the cpp code 😭

Back to digging I guess

ReMiOS commented 9 months ago

Even more strange .... Running your test_memory.txt on both Windows and Ubuntu i notice an increase in Process().memory_info().vms

testmem_ubuntu.txt testmem_windows.txt

ReMiOS commented 9 months ago

Maybe this helps ...

I've added some output to the PdfLoader.cc and PdfLoaderWrapper.cc I notice the "delete context" in PdfLoaderWrapper is never called and also i haven't found a method to check if "delete" actually removes the object

>>> loader = PdfLoader( 'Test.pdf' )
PdfLoader create doc: 000001D1CFBC4140
PdfLoaderWrapper/construct create loader:  000001D1C13C7030
PdfLoaderWrapper/construct create loaderCapsule: <capsule object "loaderPtr" at 0x000001D1D00BECF0>
>>> del loader
PdfLoaderWrapper/deleteObject:
Delete loader: 000001D1C13C7030
~PdfLoader delete doc: 000001D1CFBC4140
Number / pointer self capsule:   9 / <module 'cXpdfPython' from 'C:\\Python3\\lib\\site-packages\\cXpdfPython.cp39-win_amd64.pyd'>
Number / pointer self loaderCapsule:   3 / <capsule object "loaderPtr" at 0x000001D1D00BECF0>
Bladieblah commented 9 months ago

ASAN output:



Direct leak of 3200 byte(s) in 200 object(s) allocated from:
    #0 0xffff9a67454c in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0xaaaab88b6630 in PdfLoader::PdfLoader(LoaderConfig, char*, char*, char*) src/xpydf/PdfLoader.cc:76
    #2 0xaaaab88b6630 in main src/xpydf/PdfLoader.cc:269
    #3 0xffff9a1d9e14 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20e14)
    #4 0xaaaab88b7878  (/app/a.out+0xf7878)

Indirect leak of 6400 byte(s) in 200 object(s) allocated from:
    #0 0xffff9a67468c in operator new[](unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:102
    #1 0xaaaab8b37888 in GString::resize(int) src/xpdf-4.04/goo/GString.cc:119
    #2 0xaaaab8b37888 in GString::GString(char const*) src/xpdf-4.04/goo/GString.cc:143
    #3 0xaaaab88b663c in PdfLoader::PdfLoader(LoaderConfig, char*, char*, char*) src/xpydf/PdfLoader.cc:76
    #4 0xaaaab88b663c in main src/xpydf/PdfLoader.cc:269
    #5 0xffff9a1d9e14 in __libc_start_main (/lib/aarch64-linux-gnu/libc.so.6+0x20e14)
    #6 0xaaaab88b7878  (/app/a.out+0xf7878)```
Bladieblah commented 9 months ago

I reran the memory profiling in a linux container, and there it doesnt show the increased peak memory usage...

Bladieblah commented 9 months ago

The script in test.py also doesnt show a leak in the container... Now I'm really stumped

ReMiOS commented 9 months ago

Just guessing ..... Could it be caused by the Python version (or other library's)

On Windows i've tested using v3.9.13 and v3.11.7 ... My Ubuntu distro has v3.8.10

Bladieblah commented 9 months ago

I've done all my testing in 3.10.12 on linux and 3.10.13 on macos. I used numpy v1.26.3 on both OSs. The issue really seems to be OS related, because I did peak memory benchmarking using only cpp code

Bladieblah commented 9 months ago

Did you also encounter the issue on ubuntu? I checked peak memory performance using /usr/bin/time (install with apt-get install time if you don't have it).

Add this function at the bottom of PdfLoader.cc:

int main() {
  LoaderConfig cfg;
  char fn[100] = "tests/test_data/xpdf_tests.pdf";

  for (int i=0; i<2000; i++) {
    // std::cerr << "Iteration " << i << std::endl;
    PdfLoader l(cfg, fn);
  }

  if (globalParams) {
    delete globalParams;
  }

  return 0;
}

then compile with make and run with /usr/bin/time -v ./a.out. The line you're looking for is Maximum resident set size (kbytes). If it increases when you increase the number of iterations in the for loop, that means there is a memory leak.

ReMiOS commented 9 months ago

Am i missing something ?

/xpdf-python-main# make
g++ -std=c++17 -c -O3 -Wall -MD -Isrc/xpdf-4.04 -Isrc/xpdf-4.04/goo -Isrc/xpdf-4.04/splash -Isrc/xpdf-4.04/xpdf -Isrc/xpdf-4.04/fofi -I/usr/local/include -Isrc/freetype/include  -Isrc/freetype/include/freetype -o obj/PdfLoader.o src/xpydf/PdfLoader.cc
src/xpydf/PdfLoader.cc:273:1: fatal error: opening dependency file obj/PdfLoader.d: No such file or directory
  273 | }
      | ^
compilation terminated.
make: *** [Makefile:142: obj/PdfLoader.o] Error 1
Bladieblah commented 9 months ago

Do you have an obj folder in your project root?

ReMiOS commented 9 months ago

No just the source from https://github.com/Bladieblah/xpdf-python and the freetype files in xpdf-python-main\src\freetype\

Bladieblah commented 9 months ago

In that case creating the folder should fix it

ReMiOS commented 9 months ago

Got the following output But i am not sure what to make out of this .... Does this mean the memory leak is not in the wrapper (or Python) part ?

After 2.000 Iterations: /xpdf-python-main# /usr/bin/time -v ./a.out Command being timed: "./a.out" User time (seconds): 1.46 System time (seconds): 0.83 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.33 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 12036 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 2154 Voluntary context switches: 1 Involuntary context switches: 277 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

After 10.000 Iterations: /xpdf-python-main# /usr/bin/time -v ./a.out Command being timed: "./a.out" User time (seconds): 8.31 System time (seconds): 3.79 Percent of CPU this job got: 98% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.32 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 41460 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 9780 Voluntary context switches: 1 Involuntary context switches: 1940 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0

Bladieblah commented 9 months ago

Interesting, thats way more than I got in the docker container. You could try running it with address sanitizer. In that case:

You can keep the number of iterations low (~50) for this test

ReMiOS commented 9 months ago

ASAN did not report an error

By commenting out "delete textFileName" it appears to be working

==1681768==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 800 byte(s) in 50 object(s) allocated from:
    #0 0x7f735f713587 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:104
    #1 0x55aa9d4f7cb5 in PdfLoader::PdfLoader(LoaderConfig, char*, char*, char*) src/xpydf/PdfLoader.cc:77
    #2 0x55aa9d4f7cb5 in main src/xpydf/PdfLoader.cc:268
    #3 0x7f735f0ea082 in __libc_start_main ../csu/libc-start.c:308

Indirect leak of 1600 byte(s) in 50 object(s) allocated from:
    #0 0x7f735f713787 in operator new[](unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:107
    #1 0x55aa9d815df4 in GString::resize(int) src/xpdf-4.04/goo/GString.cc:119
    #2 0x55aa9d815df4 in GString::GString(char const*) src/xpdf-4.04/goo/GString.cc:143
    #3 0x55aa9d4f7cc7 in PdfLoader::PdfLoader(LoaderConfig, char*, char*, char*) src/xpydf/PdfLoader.cc:77
    #4 0x55aa9d4f7cc7 in main src/xpydf/PdfLoader.cc:268
    #5 0x7f735f0ea082 in __libc_start_main ../csu/libc-start.c:308
ReMiOS commented 9 months ago

I've added a memory counter in de PdfLoader destructor

#include "windows.h"
#include "psapi.h"
#include <iostream>

using namespace std;

..snip..

PdfLoader::~PdfLoader() {
  delete textFileName;
  delete doc;
  PROCESS_MEMORY_COUNTERS_EX pmc;
  GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*)&pmc, sizeof(pmc));
  SIZE_T virtualMemUsedByMe = pmc.PrivateUsage;
  cout << "RSS is " << virtualMemUsedByMe << "\n";

  Object::memCheck(stderr);
  gMemReport(stderr);
}

Output:


>>> loader = PdfLoader( 'Test.pdf' )
>>> del loader
RSS is 257441792
>>> loader = PdfLoader( 'Test.pdf' )
>>> del loader
RSS is 257650688
>>> loader = PdfLoader( 'Test.pdf' )
>>> del loader
RSS is 257818624

approximately 160~200KB is added every time PdfLoader is called

ReMiOS commented 9 months ago

Maybe this helps... When i run Valgrind like below, i see the following blocks recurring valgrind --leak-check=full --track-origins=yes /usr/bin/python3 ./test_memory.py > test2.txt 2>&1

 ==>   Invalid read of size 4
 ==>      at 0x5CBFFC: ??? (in /usr/bin/python3.8)
 ==>      by 0x67A6CD: PyImport_Cleanup (in /usr/bin/python3.8)
 ==>      by 0x67423E: Py_FinalizeEx (in /usr/bin/python3.8)
 ==>      by 0x6B418C: Py_RunMain (in /usr/bin/python3.8)
 ==>      by 0x6B43FC: Py_BytesMain (in /usr/bin/python3.8)
 ==>      by 0x4890082: (below main) (libc-start.c:308)
 ==>    Address 0x52f2020 is 16 bytes before a block of size 2 free'd
 ==>      at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
 ==>      by 0x57F6CA8: Object::free() (Object.cc:141)
 ==>      by 0x57FC2B5: Parser::shift() (Parser.cc:310)
 ==>      by 0x57FC4D2: Parser::getObj(Object*, int, unsigned char*, CryptAlgorithm, int, int, int, int) (Parser.cc:141)
 ==>      by 0x57FC7B9: Parser::getObj(Object*, int, unsigned char*, CryptAlgorithm, int, int, int, int) (Parser.cc:93)
 ==>      by 0x583336C: XRef::fetch(int, int, Object*, int) (XRef.cc:1247)
 ==>      by 0x579F79D: getCatalog (XRef.h:95)
 ==>      by 0x579F79D: Catalog::Catalog(PDFDoc*) (Catalog.cc:167)
 ==>      by 0x57FE907: PDFDoc::setup2(GString*, GString*, int) (PDFDoc.cc:318)
 ==>      by 0x57FEA28: PDFDoc::setup(GString*, GString*) (PDFDoc.cc:271)
 ==>      by 0x57FED2A: PDFDoc::PDFDoc(char*, GString*, GString*, PDFCore*) (PDFDoc.cc:218)
 ==>      by 0x578B45F: PdfLoader::PdfLoader(LoaderConfig, char*, char*, char*) (PdfLoader.cc:86)
 ==>      by 0x578A10B: construct(_object*, _object*) (PdfLoaderWrapper.cc:41)
 ==>    Block was alloc'd at
 ==>      at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
 ==>      by 0x58554AD: gmalloc(int) (gmem.cc:148)
 ==>      by 0x58556E4: copyString(char const*) (gmem.cc:393)
 ==>      by 0x57F32F2: initCmd (Object.h:111)
 ==>      by 0x57F32F2: Lexer::getObj(Object*) (Lexer.cc:417)
 ==>      by 0x57FC3A1: Parser::getObj(Object*, int, unsigned char*, CryptAlgorithm, int, int, int, int) (Parser.cc:146)
 ==>      by 0x57FC499: Parser::getObj(Object*, int, unsigned char*, CryptAlgorithm, int, int, int, int) (Parser.cc:71)
 ==>      by 0x57FC7B9: Parser::getObj(Object*, int, unsigned char*, CryptAlgorithm, int, int, int, int) (Parser.cc:93)
 ==>      by 0x583336C: XRef::fetch(int, int, Object*, int) (XRef.cc:1247)
 ==>      by 0x579F79D: getCatalog (XRef.h:95)
 ==>      by 0x579F79D: Catalog::Catalog(PDFDoc*) (Catalog.cc:167)
 ==>      by 0x57FE907: PDFDoc::setup2(GString*, GString*, int) (PDFDoc.cc:318)
 ==>      by 0x57FEA28: PDFDoc::setup(GString*, GString*) (PDFDoc.cc:271)
 ==>      by 0x57FED2A: PDFDoc::PDFDoc(char*, GString*, GString*, PDFCore*) (PDFDoc.cc:218)
ReMiOS commented 8 months ago

It appears on a Windows System it does not make a difference if the delete statements are called or not.....

I've called "loader = PdfLoader( 'Test.pdf' )" a number of times and looked at the RSS memory

Standard Source Memory RSS after each iteration / Growth on RSS Memory 32985088 / not loaded 35459072 / 2473984 35684352 / 225280 35852288 / 167936 36020224 / 167936 36143104 / 122880 36343808 / 200704 36560896 / 217088 36704256 / 143360

removed "delete loader" and "delete context" from deleteObject method in PdfLoaderWrapper.cc Memory RSS after each iteration / Growth on RSS Memory 33079296 / not loaded 35549184 / 2469888 35766272 / 217088 35983360 / 217088 36147200 / 163840 36270080 / 122880 36532224 / 262144 36741120 / 208896

Is there another method to delete the object from memory ? (tried Py_DECREF and Py_CLEAR, but i'm not sure if i've done this right)

Bladieblah commented 7 months ago

I ran into some segfault issues with Py_DECREF and Py_CLEAR, I think I'll need a python debug build to do a full memory profile, but even just getting the debug build up and running is a lot of work

ReMiOS commented 7 months ago

It would be great if you could find anything to solve this.

I've tried the pdb module, but haven't found anything useful yet also i do not completely understand the Py_REFCNT values with Py_DECREF and Py_CLEAR...