dpwe / audfprint

Landmark-based audio fingerprinting
MIT License
538 stars 121 forks source link

About max bucketsize. #28

Closed altec404 closed 7 years ago

altec404 commented 7 years ago

Hello dpwe, The error was occurred when I set --bucketsize 512 or more. Why max bucket num is less than 511? Sorry my poor English. Thanks.

dpwe commented 7 years ago

The system allocates a fixed array of numhashes bucketsize 4 bytes. With nhashbits==20 (the default), a bucketsize of 512 pushes the array size to a number larger than can be stored in a 32 bit int.

I verify that I get the same problem on my 64 bit MacOS system. The error is actually in saving the large data structure:

dpwe@dpwe-macbookpro2:~/projects/audfprint$ ~/homebrew/bin/python
audfprint.py new --bucketsize 512 --dbase fpdbase.pkl Nine_Lives/01*.mp3
Mon May  1 09:16:23 2017 ingesting #0: Nine_Lives/01-Nine_Lives.mp3 ...
Added 194 hashes (19.4 hashes/sec)
Processed 1 files (10.0 s total dur) in 0.0 s sec = 0.005 x RT
Traceback (most recent call last):
  File "audfprint.py", line 482, in <module>
    main(sys.argv)
  File "audfprint.py", line 477, in main
    hash_tab.save(dbasename)
  File "/Users/dpwe/projects/audfprint/hash_table.py", line 168, in save
    pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)
SystemError: error return without exception set

Some quick googling suggests the problem is in the underlying MacOS implementation (I'm using Python 2.7.9 from Homebrew):

http://stackoverflow.com/questions/31468117/python-3-can-pickle-handle-byte-objects-larger-than-4gb

.. which suggests a workaround by replacing the pickle.dump command with writing in smaller pieces, but I don't know if it's worth putting a work-around like that into the main code base; such large databases are perhaps not the ideal use case.

DAn.

On Mon, May 1, 2017 at 5:18 AM, altec404 notifications@github.com wrote:

Hello dpwe, The error was occurred when I set --bucketsize 512 or more. Why max bucket num is less than 511? Sorry my poor English. Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

altec404 commented 7 years ago

Hello DAn. Thanks for your reply. And thanks for your advice. Same error happened in Matlab version. So I changed hash_table.py

line 17: import joblib

near line 168; with gzip.open(name, 'wb') as f:

pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)

        joblib.dump(self,f)

near line 195; with gzip.open(name, 'rb') as f:

temp = pickle.load(f)

        temp = joblib.load(f)

Then it looks to be working well for now.

add tracks from here with 1024 bucket

[mori@qb-AWave-Server:~/audfprint$] python audfprint.py new --dbase /home/audfprint/tiny/test.pklz --bucketsize 1024 /mnt/mp3/mp3/AB/AB000*.mp3

Tue May 2 15:47:34 2017 ingesting #0: /mnt/mp3/mp3/AB/AB00001.mp3 ... Tue May 2 15:47:36 2017 ingesting #1: /mnt/mp3/mp3/AB/AB00002.mp3 ... Tue May 2 15:47:37 2017 ingesting #2: /mnt/mp3/mp3/AB/AB00003.mp3 ... Tue May 2 15:47:39 2017 ingesting #3: /mnt/mp3/mp3/AB/AB00004.mp3 ... . . . Added 326827 hashes (19.2 hashes/sec) Processed 99 files (17018.8 s total dur) in 79.4 s sec = 0.005 x RT Saved fprints for 99 files ( 326827 hashes) to /home/audfprint/tiny/test.pklz Dropped hashes= 0 (0.00%)

match edited mp3 file from here

[mori@qb-AWave-Server:~/audfprint$] python audfprint.py match --dbase /home/audfprint/tiny/test.pklz /home/audfprint/test47tk-7s.mp3

Tue May 2 16:07:53 2017 Reading hash table /home/audfprint/tiny/test.pklz Read fprints for 99 files ( 326827 hashes) from /home/audfprint/tiny/test.pklz Tue May 2 16:08:06 2017 Analyzed #0 /home/audfprint/test47tk-7s.mp3 of 7.221 s to 383 hashes Matched /home/audfprint/test47tk-7s.mp3 7.2 sec 383 raw hashes as /mnt/mp3/mp3/AB/AB00047.mp3 at 21.1 s with 19 of 20 common hashes at rank 0 Processed 1 files (7.5 s total dur) in 13.8 s sec = 1.828 x RT

I want to add 500k+ tracks. and want to save with small number "dropped hashes". So I do to increase bucket number. (but...My idea might be wrong)

Thanks so much.

dpwe commented 7 years ago

Wow, I didn't know about joblib.dump, but it sounds like a much better match to the audfprint use case. Thank you!

DAn.

On Tue, May 2, 2017 at 3:29 AM, altec404 notifications@github.com wrote:

Hello DAn. Thanks for your reply. And thanks for your advice. Same error happened in Matlab version. So I changed hash_table.py

line 17: import joblib

near line 168; with gzip.open(name, 'wb') as f:

pickle.dump(self, f, pickle.HIGHEST_PROTOCOL)

joblib.dump(self,f)

near line 195; with gzip.open(name, 'rb') as f:

temp = pickle.load(f)

temp = joblib.load(f)

Then it looks to be working well for now.

add tracks from here with 1024 bucket

[mori@qb-AWave-Server:~/audfprint$] python audfprint.py new --dbase /home/audfprint/tiny/test.pklz --bucketsize 1024 /mnt/mp3/mp3/AB/AB000*.mp3

Tue May 2 15:47:34 2017 ingesting #0: /mnt/mp3/mp3/AB/AB00001.mp3 ... Tue May 2 15:47:36 2017 ingesting #1 https://github.com/dpwe/audfprint/issues/1: /mnt/mp3/mp3/AB/AB00002.mp3 ... Tue May 2 15:47:37 2017 ingesting #2 https://github.com/dpwe/audfprint/issues/2: /mnt/mp3/mp3/AB/AB00003.mp3 ... Tue May 2 15:47:39 2017 ingesting #3 https://github.com/dpwe/audfprint/issues/3: /mnt/mp3/mp3/AB/AB00004.mp3 ... . . . Added 326827 hashes (19.2 hashes/sec) Processed 99 files (17018.8 s total dur) in 79.4 s sec = 0.005 x RT Saved fprints for 99 files ( 326827 hashes) to /home/audfprint/tiny/test.pklz Dropped hashes= 0 (0.00%)

match edited mp3 file from here

[mori@qb-AWave-Server:~/audfprint$] python audfprint.py match --dbase /home/audfprint/tiny/test.pklz /home/audfprint/test47tk-7s.mp3

Tue May 2 16:07:53 2017 Reading hash table /home/audfprint/tiny/test.pklz Read fprints for 99 files ( 326827 hashes) from /home/audfprint/tiny/test.pklz Tue May 2 16:08:06 2017 Analyzed #0 /home/audfprint/test47tk-7s.mp3 of 7.221 s to 383 hashes Matched /home/audfprint/test47tk-7s.mp3 7.2 sec 383 raw hashes as /mnt/mp3/mp3/AB/AB00047.mp3 at 21.1 s with 19 of 20 common hashes at rank 0 Processed 1 files (7.5 s total dur) in 13.8 s sec = 1.828 x RT

I want to add 500k+ tracks. and want to save with small number "dropped hashes". So I do to increase bucket number. (but...My idea might be wrong)

Thanks so much.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/issues/28#issuecomment-298526322, or mute the thread https://github.com/notifications/unsubscribe-auth/AAhs0Wgh2KwoF4KjxFvyqpiAW_GeyMevks5r1ttsgaJpZM4NM7pk .