Closed OblackatO closed 4 years ago
match is thread safe.
I analyzed the function after posting this issue and arrived to that conclusion. I am not into cython, but do you think it would be possible to release the GIL only for the match function? This has the following advantages:
From what I have read, nogil requires that no Python code is executed. However, in the match function there are several python objects, such as xstring or the trie. It would be necessary to change the def statement of match to cdef. Also, the word yield could not make part of the function anymore.
@OblackatO could you please give a patch? it seems that you know GIL better than me.
Ty for your answer @nppoly, 1t of all let me just tell you that your library is probably the best python library I found in terms of speed and memory efficiency dealing with aho corasick.
... could you please give a patch?
Before posting this issue I already try it. I do not think I can. As I said I am not familiar with Cython, but I learnt a bit about it to better understand the code of this library. It seems that the no GIL feature of Cython requires no Python objects at all in the code, only then the GIL can be release. This makes perfect sense to me. From the Cython docs we can read that:
... Code in the body of the with-statement must not manipulate Python objects in any way, and must not call anything that manipulates Python objects without first re-acquiring the GIL
The code on this library has python objects, the AC and the trie. Which makes me think that in order to release the GIL with the noGIL feature of Cython, the majority of the code of this library needs to be re-written (but I might be wrong, no Cython expert)
Even so, here is what I tried already(before posting this issue).
I know this is not a lot, but this is what I could do without starting to re-write all the objects of the library. What we really need is just to make the method match to be used with the nogil feature, because that way we can have several threads sharing the same AC. But since the match function makes part of a Python object, this is not possible to do.
Just to conclude this, sharing the AC automaton would have the following advantages:
@nppoly I might have an idea to share the AC with the new features of shared memory of Python3.8 without re-writing anything at all on this library, but I need to ask you something first. In the file ac.pyx file the function getstate ():
array_to_bytes(<char*>self.output, self.trie.array_size * sizeof(OutNode)),
array_to_bytes(<char*>self.fails, self.trie.array_size * sizeof(int)),
array_to_bytes(<char*>self.key_lens, self.trie.leaf_size * sizeof(unsigned int)),
All the elements that have <char*> are pointers?
The same question for the function setstate():
self.trie, output, fails, key_lens = data
self.output = <OutNode*> bytes_to_array(output, self.trie.array_size * sizeof(OutNode))
self.fails = <int*> bytes_to_array(fails, self.trie.array_size * sizeof(int))
self.key_lens = <unsigned int*> bytes_to_array(key_lens, self.trie.leaf_size * sizeof(unsigned int))
Is <OutNode>, <int>, <unsigned int*> pointers?
If yes, this means that when we use getstate() we are just creating pointers to the original AC, and not copying memory right?
@OblackatO cedar trie allocs three arrays, ac automata allocs three arrays. they don't alloc any other memories. so share memory of these arrays will work. but actually the array_to_bytes bytes_to_array copy memory(if don't copy memory, segment fault will happen during pickle).
can we use mmap? add one method that writes data into file, one method that inits automata from array.
... but actually the array_to_bytes bytes_to_array copy memory(if don't copy memory, segment fault will happen during pickle).
Ok, good to know. Then I guess my idea will not do any good. I wanted to convert the AC to bytes, share using shared mem, or mmap as you suggested (Just instead of writing into a file, I would write to memory directly) and then access that piece of memory in other processes. However, because bytes_to_array copy, this will not do us any good, because each process is going to have a copy of the AC and not sharing memory. When we convert from bytes to the actual AC, memory is always copied.
The only two possibilities I can think of is really just release the GIL with cython, which we all know the implications OR use numba ! using its (nogil)[http://numba.pydata.org/numba-doc/dev/user/jit.html] feature.
Numa is basically a very easy to use library that allows use to release the GIL, however again, Python code cannot be used, which is normal. So the question now is, in the description of the project @nppoly, you wrote: "It's implemented by cython, and will be compiled to cpp" so the actual code being called in match function is cpp code? if that is the case I might be able to make it work with numba.
This would allow us to:
If we use mmap. there's no need to copy data. we can write data in our format, we can get rid of pickle. I can implement it, it's easy. @OblackatO
I see, that is great news. Could you just confirm that you can indeed:
If you can do that, I can implement a shareable AC between several Python processes using V shared memory. This would be amazing. There is not a single Python library that can do that. However, this would require this project to use Python3.8 or higher, is that a problem? If yes, we could just put that patch in a separate branch.
@OblackatO I have implemented api which can init trie or ac from buffer, you can trie it. python mmap supports buffer protocol: https://github.com/python/cpython/blob/74b4eda98b650245216292ace8d7e30d3997baa7/Modules/mmapmodule.c.
Ok, I will give it a try and do some memory analysis. :) I will get back to you when I am done.
I am done with the memory analysis. This is going to be a big post, because there are many things that I need to state and that you should take in consideration.
First of all, I used memory-profiler to make these memory analysis. When I was in doubt if the results of memory-profile were accurate, I analyzed the memory using the ps tool, with the following command:
ps -p <process_id> -o pcpu,rss,pid,command
, where rss is the Resident Set Size memory (more simply the actual physical memory).
I tried to construct several automatons with the argument copy=False and see how the memory scaled. The code is here. The memory analysis output the following:
Filename: shared_buffer_test2.py
Line # Mem usage Increment Occurences Line Contents
============================================================
28 143.9 MiB 143.9 MiB 1 @profile
29 def load_and_search():
30 143.9 MiB 0.0 MiB 1 with open("ac_trie", "r+b") as bf:
31 143.9 MiB 0.0 MiB 1 mm = mmap.mmap(bf.fileno(), 0)
32 143.9 MiB 0.0 MiB 1 ac_first = AC.from_buff(mm, copy=False) # it shares memory
33 143.9 MiB 0.0 MiB 1 ac_second = AC.from_buff(mm, copy=False) # it shares memory
34 143.9 MiB 0.0 MiB 1 ac_third = AC.from_buff(mm, copy=False) # it shares memory
35 143.9 MiB 0.0 MiB 1 ac_four = AC.from_buff(mm, copy=False) # it shares memory
36 143.9 MiB 0.0 MiB 1 print("Matches after loading shared buffer:")
37 143.9 MiB 0.0 MiB 3 for id, start, end in ac_first.match(string_to_search):
38 143.9 MiB 0.0 MiB 2 print(id, string_to_search[start:end])
We can see that in lines 33,34 and 35 the memory does not increase when creating new automatons. After analyzing the ac.pyx from_buff, this is the expected behavior. The memory is indeed shared between all the ac variables. Note that the 143MiB are not entirely from the AC, the patterns and the other variables also count.
I did the same thing as in Test 2, however I set the copy=True Memory analysis output:
Line # Mem usage Increment Occurences Line Contents
============================================================
28 144.8 MiB 144.8 MiB 1 @profile
29 def load_and_search():
30 144.8 MiB 0.0 MiB 1 with open("ac_trie", "r+b") as bf:
31 144.8 MiB 0.0 MiB 1 mm = mmap.mmap(bf.fileno(), 0)
32 240.9 MiB 96.1 MiB 1 ac_first = AC.from_buff(mm, copy=True)
33 288.9 MiB 48.0 MiB 1 ac_second = AC.from_buff(mm, copy=True)
34 336.8 MiB 48.0 MiB 1 ac_third = AC.from_buff(mm, copy=True)
35 384.8 MiB 48.0 MiB 1 ac_four = AC.from_buff(mm, copy=True)
36 384.8 MiB 0.0 MiB 1 print("Matches after loading shared buffer:")
37 384.8 MiB 0.0 MiB 3 for id, start, end in ac_first.match(string_to_search):
38 384.8 MiB 0.0 MiB 2 print(id, string_to_search[start:end])
In lines 33, 34 and 35 the memory increased as expected. This is a clear sign that the copy/not copy arg is functioning properly. We can also see that the AC has around 50MiB.
Code here I created 6 python processes and shared the AC automaton between them. I also tried to find some matches within each Process to make sure the AC automaton worked properly. Before diving into this, @nppoly Python processes cannot share memory using mmap, because mmap instances cannot be pickled. This is one of the reasons behind the shared_memory features of Python3.8. So, when you say in the readme.md file that: "we can use mmap to load file, share data between processes." this is true, but misleading. It is true that we can indeed share memory between processes with mmap, but not using Python processes. Let me illustrate this what a quick example. If I create a Python process like this:
p = Process(
target=get_shared_memory_find_matches,
args=(
"process_" + str(x),
shared_memory_tag,
ac_size
)
)
and if I put an instance of mmap in the args it will raise a TypeError because the mmap object cannot be pickled, which is needed to pass the data to the child process. We can use the os.fork() API, but this is not the best solution.
In the code I am simply putting in the shared memory buffer the mmap instance. I also set to None all unnecessary variables before creating the child processes. Otherwise they will be (rather, will almost always be) copied to the child processes. For instance, if the var ac_patterns, which has all the patterns to build the AC is not set to None, the garbage collector is not going to kick in and we will see an additional 20MBytes in the child processes.
The line: print("Size of AC automaton:Total patterns: {}bytes:{}patterns".format(ac_size, total_patterns)) outputs the following: Size of AC automaton:Total patterns: 50476096bytes:759000patterns
50676096Bytes is around 50MBytes. Given that the AC automaton has 75k patterns this makes total sense for me. Mostly because of the memory analysis I did to compare pyahocorasick and cyac.
Finally, I analyzed the memory on each process, individually.
Executing search in process_0
...
Filename: shared_buffer_processes.py
Line # Mem usage Increment Occurences Line Contents
============================================================
41 33.5 MiB 33.5 MiB 1 @profile
42 def get_shared_memory_find_matches(processname, shared_memory_tag, ac_size):
43 33.5 MiB 0.0 MiB 1 shm = shared_memory.SharedMemory(shared_memory_tag)
44 33.5 MiB 0.0 MiB 1 AC_in_bytes = shm.buf[0:ac_size]
45 33.6 MiB 0.1 MiB 1 ac_in_process = AC.from_buff(AC_in_bytes, copy=False)
46 33.6 MiB 0.0 MiB 1 string_to_search = "asdpythonasdasdruby"
47 33.6 MiB 0.0 MiB 1 print("Executing search in {}".format(processname))
48 33.6 MiB 0.0 MiB 3 for id, start, end in ac_in_process.match(string_to_search):
49 33.6 MiB 0.0 MiB 2 print(id, string_to_search[start:end])
50 #time.sleep(100)
51 33.6 MiB 0.0 MiB 1 ac_in_process = None
52 33.6 MiB 0.0 MiB 1 AC_in_bytes.release() # MUST release memory beforing closing shm insance, otherwise error is raised.
53 33.9 MiB 0.3 MiB 1 shm.close()
I had the exact same results for the remaining 5 processes. Not only we can see that the memory does not increase when getting the shared bytes, but we can also see that the memory used is less than 50MBytes, which is the size of the AC automaton, that resides in the Virtual Address Space of the main process. This 33.6MiB, can be justified by the fact the the child processes launch a brand new Virtual Address Space with a brand new Python interpreter. I am also pretty sure of this, because if I increase the number of iterations in the line:
for x in range(0,1000):
for key in patterns_dict:
for value in patterns_dict[key]:
total_patterns += 1
ac_patterns.append(value+str(random.randint(0,10000)))
from 1k to 10k, the memory in the child processes remains stable at: 34MiB-36MiB., even if the AC is a lot bigger.This is, in my humble opinion, the biggest proof that the processes are indeed using shared memory. The exact same piece of physical memory allocated with the AC automaton.
I was not so sure about this, but after running the ps command, as described at the beginning, and seeing the child processes in a tree in htop, the results were exactly the same of memory_profile. Literally the same, not similar.
Finally, after this memory analysis, my humble conclusion is that the shared_buffer works as expected. This is just amazing. You have no idea for how long I have been trying to do something like this, efficiently. You can read more about it, in other python libraries that also implement the ahocorasick algorithm, here and here. Thank you for your time, interest and for this this great library! I highly appreciate it.
Just one question for you, @nppoly Would you like me to write a function so that users can easily use shared memory. A function that does all the shared memory stuff and returns a string that is the shared memory tag. Another function that takes that shared memory tag as an arg and builds the AC from shared mem. Similar to what I did in Test 3.
One final remark, it would be a good idea to state in the readme.md that the match function is thread/process safe, and to modify the sentence: "we can use mmap to load file, share data between processes." , which can IMO, be misleading as explained earlier. But that is up to you :)
edit: Changed some phrases syntax.
for mmap, you should open file, and call mmap in every process. the operating system knows that you are sharing this file.
""" Would you like me to write a function so that users can easily use shared memory. """
it's most welcome.
for mmap, you should open file, and call mmap in every process. the operating system knows that you are sharing this file.
Yes, you are right. I was just tying to point out that we cannot do it while passing arguments to child python processes.
it's most welcome.
Ok, I will do it.
@nppoly I reconsider implementing the shared_memory part and I do not think it is a good idea anymore. It could bring more confusion than good. I added an example with multiprocessing in the readme.md and referenced this issue. I made a pull request #5 as well. If in the future someone needs to use shared_memory I could easily implement it in another branch. Besides, an example can already be found on this issue.
Is this library thread safe after calling the build function? If the trie is built and several threads access it using the match function, would this be thread-safe?