Closed altec404 closed 7 years ago
The system allocates a fixed array of numhashes bucketsize 4 bytes. With nhashbits==20 (the default), a bucketsize of 512 pushes the array size to a number larger than can be stored in a 32 bit int.
I verify that I get the same problem on my 64 bit MacOS system. The error is actually in saving the large data structure:
dpwe@dpwe-macbookpro2:~/projects/audfprint$ ~/homebrew/bin/python
audfprint.py new --bucketsize 512 --dbase fpdbase.pkl Nine_Lives/01*.mp3
Mon May 1 09:16:23 2017 ingesting #0: Nine_Lives/01-Nine_Lives.mp3 ...
Added 194 hashes (19.4 hashes/sec)
Processed 1 files (10.0 s total dur) in 0.0 s sec = 0.005 x RT
Traceback (most recent call last):
File "audfprint.py", line 482, in <module>
main(sys.argv)
File "audfprint.py", line 477, in main
hash_tab.save(dbasename)
File "/Users/dpwe/projects/audfprint/hash_table.py", line 168, in save
pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
SystemError: error return without exception set
Some quick googling suggests the problem is in the underlying MacOS implementation (I'm using Python 2.7.9 from Homebrew):
http://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb
.. which suggests a workaround by replacing the pickle.dump command with writing in smaller pieces, but I don't know if it's worth putting a work-around like that into the main code base; such large databases are perhaps not the ideal use case.
DAn.
On Mon, May 1, 2017 at 5:18 AM, altec404 notifications@github.com wrote:
Hello dpwe, The error was occurred when I set --bucketsize 512 or more. Why max bucket num is less than 511? Sorry my poor English. Thanks.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hello DAn. Thanks for your reply. And thanks for your advice. Same error happened in Matlab version. So I changed hash_table.py
line 17: import joblib
near line 168; with gzip.open(name, 'wb') as f:
joblib.dump(self,f)
near line 195; with gzip.open(name, 'rb') as f:
temp = joblib.load(f)
Then it looks to be working well for now.
[mori@qb-AWave-Server:~/audfprint$] python audfprint.py new --dbase /home/audfprint/tiny/test.pklz --bucketsize 1024 /mnt/mp3/mp3/AB/AB000*.mp3
Tue May 2 15:47:34 2017 ingesting #0: /mnt/mp3/mp3/AB/AB00001.mp3 ... Tue May 2 15:47:36 2017 ingesting #1: /mnt/mp3/mp3/AB/AB00002.mp3 ... Tue May 2 15:47:37 2017 ingesting #2: /mnt/mp3/mp3/AB/AB00003.mp3 ... Tue May 2 15:47:39 2017 ingesting #3: /mnt/mp3/mp3/AB/AB00004.mp3 ... . . . Added 326827 hashes (19.2 hashes/sec) Processed 99 files (17018.8 s total dur) in 79.4 s sec = 0.005 x RT Saved fprints for 99 files ( 326827 hashes) to /home/audfprint/tiny/test.pklz Dropped hashes= 0 (0.00%)
[mori@qb-AWave-Server:~/audfprint$] python audfprint.py match --dbase /home/audfprint/tiny/test.pklz /home/audfprint/test47tk-7s.mp3
Tue May 2 16:07:53 2017 Reading hash table /home/audfprint/tiny/test.pklz Read fprints for 99 files ( 326827 hashes) from /home/audfprint/tiny/test.pklz Tue May 2 16:08:06 2017 Analyzed #0 /home/audfprint/test47tk-7s.mp3 of 7.221 s to 383 hashes Matched /home/audfprint/test47tk-7s.mp3 7.2 sec 383 raw hashes as /mnt/mp3/mp3/AB/AB00047.mp3 at 21.1 s with 19 of 20 common hashes at rank 0 Processed 1 files (7.5 s total dur) in 13.8 s sec = 1.828 x RT
I want to add 500k+ tracks. and want to save with small number "dropped hashes". So I do to increase bucket number. (but...My idea might be wrong)
Thanks so much.
Wow, I didn't know about joblib.dump, but it sounds like a much better match to the audfprint use case. Thank you!
DAn.
On Tue, May 2, 2017 at 3:29 AM, altec404 notifications@github.com wrote:
Hello DAn. Thanks for your reply. And thanks for your advice. Same error happened in Matlab version. So I changed hash_table.py
line 17: import joblib
near line 168; with gzip.open(name, 'wb') as f:
pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
joblib.dump(self,f)
near line 195; with gzip.open(name, 'rb') as f:
temp = pickle.load(f)
temp = joblib.load(f)
Then it looks to be working well for now.
add tracks from here with 1024 bucket
[mori@qb-AWave-Server:~/audfprint$] python audfprint.py new --dbase /home/audfprint/tiny/test.pklz --bucketsize 1024 /mnt/mp3/mp3/AB/AB000*.mp3
Tue May 2 15:47:34 2017 ingesting #0: /mnt/mp3/mp3/AB/AB00001.mp3 ... Tue May 2 15:47:36 2017 ingesting #1 https://github.com/dpwe/audfprint/issues/1: /mnt/mp3/mp3/AB/AB00002.mp3 ... Tue May 2 15:47:37 2017 ingesting #2 https://github.com/dpwe/audfprint/issues/2: /mnt/mp3/mp3/AB/AB00003.mp3 ... Tue May 2 15:47:39 2017 ingesting #3 https://github.com/dpwe/audfprint/issues/3: /mnt/mp3/mp3/AB/AB00004.mp3 ... . . . Added 326827 hashes (19.2 hashes/sec) Processed 99 files (17018.8 s total dur) in 79.4 s sec = 0.005 x RT Saved fprints for 99 files ( 326827 hashes) to /home/audfprint/tiny/test.pklz Dropped hashes= 0 (0.00%)
match edited mp3 file from here
[mori@qb-AWave-Server:~/audfprint$] python audfprint.py match --dbase /home/audfprint/tiny/test.pklz /home/audfprint/test47tk-7s.mp3
Tue May 2 16:07:53 2017 Reading hash table /home/audfprint/tiny/test.pklz Read fprints for 99 files ( 326827 hashes) from /home/audfprint/tiny/test.pklz Tue May 2 16:08:06 2017 Analyzed #0 /home/audfprint/test47tk-7s.mp3 of 7.221 s to 383 hashes Matched /home/audfprint/test47tk-7s.mp3 7.2 sec 383 raw hashes as /mnt/mp3/mp3/AB/AB00047.mp3 at 21.1 s with 19 of 20 common hashes at rank 0 Processed 1 files (7.5 s total dur) in 13.8 s sec = 1.828 x RT
I want to add 500k+ tracks. and want to save with small number "dropped hashes". So I do to increase bucket number. (but...My idea might be wrong)
Thanks so much.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/issues/28#issuecomment-298526322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhs0Wgh2KwoF4KjxFvyqpiAW_GeyMevks5r1ttsgaJpZM4NM7pk .
Hello dpwe, The error was occurred when I set --bucketsize 512 or more. Why max bucket num is less than 511? Sorry my poor English. Thanks.