WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

memory usage increases after fork the process #112

Open gladtosee opened 5 years ago

gladtosee commented 5 years ago

@WojciechMula hi? I am using the pyahocorasick well. But I have a problem.

A minor-page fault occurs, which increases the memory usage of the child process. (https://en.wikipedia.org/wiki/Copy-on-write) I forked after using gc.freeze(), but a page fault occurred. (https://docs.python.org/3/library/gc.html#gc.freeze)

What should I do??

I used perf to get the following results. perf record -e minor-faults -g -p PID

In trace_begin:

In trace_end:                                               

There is 386377 records in gen_events table                 
Statistics about the general events grouped by thread/symbol/dso: 

            comm   number        histogram                  
==========================================                  
          python   340070     ###################           
         python3    46307     ################              

                          symbol   number        histogram  
==========================================================  
      automaton_search_iter_next   242330     ################## 
          automaton_build_output    30382     ############### 
                      do_mktuple    23836     ############### 
                 PyObject_Malloc    17587     ###############  
                     _int_malloc    14089     ##############
               trienode_get_next    10282     ##############
                 PyMember_GetOne    10084     ##############
        _PyEval_EvalFrameDefault     7855     ############# 
        lookdict_unicode_nodummy     5679     ############# 
                      do_mkvalue     5643     ############# 
                  dict_subscript     3694     ############  
                         collect     1668     ###########   
                    visit_decref     1440     ###########   
          PyObject_GetAttrString     1376     ###########   
                   PyMem_Realloc     1178     ###########   
                pymalloc_realloc      837     ##########    
                  bytearray_init      830     ##########    
                    PyMem_Malloc      728     ##########    
                   PyList_Append      632     ##########    
                  _PyList_Extend      627     ##########    
                   dict_traverse      626     ##########    
                  tupleiter_next      564     ##########    
              malloc_consolidate      412     #########     
       PyBytes_FromStringAndSize      375     #########     
                   List_iterNext      271     #########     
            stringlib_bytes_join      225     ########      
                         set_add      219     ########      
                     PyTuple_New      213     ########      
                    PyMem_Calloc      211     ########      
         Object_beginTypeContext      176     ########      
            _PyFrame_New_NoTrack      171     ########      
             PyObject_GC_UnTrack      128     ########      
                        dict_get      116     #######       
                       sysmalloc       99     #######       
                PyObject_GetAttr       81     #######       
                   PyObject_Free       71     #######       
gladtosee commented 5 years ago

FYI https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf

WojciechMula commented 5 years ago

@gladtosee To be honest I wasn't aware of this problem, you are the first one mentioning it. I need to learn a little bit about this issue. Thanks for these articles.

gladtosee commented 5 years ago

@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.

#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject *)(op))->ob_refcnt++)

If i modify the code like this: copy on write does not happen.

//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380
PyObject*
_PyUnicode_Copy(PyObject *unicode)
{
    Py_ssize_t length;
    PyObject *copy;

    if (!PyUnicode_Check(unicode)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1)
        return NULL;

    length = PyUnicode_GET_LENGTH(unicode);
    copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode));
    if (!copy)
        return NULL;
    assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode));

    memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
           length * PyUnicode_KIND(unicode));
//    assert(_PyUnicode_CheckConsistency(copy, 1));
    return copy;
}

static int automaton_build_output(PyObject* self, PyObject** result);

case STORE_ANY:
    if(PyUnicode_Check(node->output.object)) {
        //N: Same as O, except it doesn’t increment the reference count on the object.
        *result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object));
    }
    else {
        *result = F(Py_BuildValue)("iO", idx, node->output.object);
    }
    return OutputValue;
WojciechMula commented 4 years ago

@gladtosee Could you please provide a patch for this?

yuanchaofa commented 4 years ago

@WojciechMula After loading the data from the master process, the child process increments the ref count and causes a Copy-On-Write. Because Py_BuildValue() increase the reference count before returning from the automaton_build_output function.

#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject *)(op))->ob_refcnt++)

If i modify the code like this: copy on write does not happen.

//copy from cpython source - https://github.com/python/cpython/blob/v3.7.3/Objects/unicodeobject.c#L2380
PyObject*
_PyUnicode_Copy(PyObject *unicode)
{
    Py_ssize_t length;
    PyObject *copy;

    if (!PyUnicode_Check(unicode)) {
        PyErr_BadInternalCall();
        return NULL;
    }
    if (PyUnicode_READY(unicode) == -1)
        return NULL;

    length = PyUnicode_GET_LENGTH(unicode);
    copy = PyUnicode_New(length, PyUnicode_MAX_CHAR_VALUE(unicode));
    if (!copy)
        return NULL;
    assert(PyUnicode_KIND(copy) == PyUnicode_KIND(unicode));

    memcpy(PyUnicode_DATA(copy), PyUnicode_DATA(unicode),
           length * PyUnicode_KIND(unicode));
//    assert(_PyUnicode_CheckConsistency(copy, 1));
    return copy;
}

static int automaton_build_output(PyObject* self, PyObject** result);

case STORE_ANY:
    if(PyUnicode_Check(node->output.object)) {
        //N: Same as O, except it doesn’t increment the reference count on the object.
        *result = F(Py_BuildValue)("iN", idx, _PyUnicode_Copy(node->output.object));
    }
    else {
        *result = F(Py_BuildValue)("iO", idx, node->output.object);
    }
    return OutputValue;

I tried to use your code and reinstall, but there are some errors. Symbol not found: _PyUnicode_DATA Would you give me more details about how you solve your problem

pombredanne commented 2 years ago

@yuanchaofa do you mind to provide a PR or patch? it would be much easier to review. Thanks!