Too few patches? (.syx being clobbered?)

turian commented 2 weeks ago

Both your script and DX7-JAX use the same DX7 all the web zip and same 118 bytes for each patch.

However, they report: "If you run it on all of DX7_AllTheWeb, then 388,650 presets will be de-duplicated into 44,884."

Your README reports: "From the 140192 patches downloaded, only 29830 are unique."

I suspect these lines are the culprit in your preprocessing:

find ./DX7_AllTheWeb -name '*.SYX' -exec mv -f {} ./all_patches/ \;
find ./DX7_AllTheWeb -name '*.syx' -exec mv -f {} ./all_patches/ \;

turian commented 2 weeks ago

I tried applying this patch to make sure I used all the .syx and .SYX files:

+++ b/dataset/generate_dataset.sh
@@ -2,15 +2,16 @@ echo "[INFO] Downloading patch compilation . . ."
 wget http://bobbyblues.recup.ch/yamaha_dx7/patches/DX7_AllTheWeb.zip
 unzip DX7_AllTheWeb.zip

-mkdir all_patches
-echo "[INFO] Searching for all DX7 patch files  . . ."
-find ./DX7_AllTheWeb -name '*.SYX' -exec mv -f {} ./all_patches/ \;
-find ./DX7_AllTheWeb -name '*.syx' -exec mv -f {} ./all_patches/ \;
+#mkdir all_patches
+#echo "[INFO] Searching for all DX7 patch files  . . ."
+#find ./DX7_AllTheWeb -name '*.SYX' -exec mv -f {} ./all_patches/ \;
+#find ./DX7_AllTheWeb -name '*.syx' -exec mv -f {} ./all_patches/ \;
 echo "[INFO] Packing patches onto a single file. This may take a while. . . "
-python3 patchpacker.py ./all_patches
+#python3 patchpacker.py ./all_patches
+python3 patchpacker.py ./DX7_AllTheWeb

 echo "[INFO] Cleaning up . . ."
 rm -f DX7_AllTheWeb.zip
-rm -r -f all_patches
+#rm -r -f all_patches
 rm -r -f DX7_AllTheWeb
-echo "[INFO] Done! You should have a new collection.bin file! "
\ No newline at end of file
+echo "[INFO] Done! You should have a new collection.bin file! "
diff --git a/dataset/patchpacker.py b/dataset/patchpacker.py
index 65894b0..18f4b46 100644
--- a/dataset/patchpacker.py
+++ b/dataset/patchpacker.py
@@ -20,8 +20,9 @@ http://bobbyblues.recup.ch/yamaha_dx7/dx7_patches.html

 import numpy as np
 import os
-from os import listdir
+import glob
 from os.path import isfile, join
+from tqdm import tqdm
 import sys
 from zlib import crc32

@@ -43,7 +44,12 @@ def get_unique(hashlist,patches,n_similar):

 mypath = os.path.abspath(sys.argv[1])

-onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
+paths = [join(mypath, f) for f in ["*.syx", "*.SYX",  "**/*.syx", "**/*.SYX"]]
+onlyfiles = []
+for p in paths:
+    onlyfiles += list(glob.glob(p, recursive=True))
+onlyfiles = list(set(onlyfiles))
+onlyfiles = [p for p in onlyfiles if isfile(p)]

 print("Processing {} files. Please wait . . .".format(len(onlyfiles)))

@@ -52,8 +58,8 @@ hashlist = np.empty(0)
 n_total_processed = 0
 n_similar = 0

-for i in range(len(onlyfiles)):
-    filearray = np.fromfile(mypath + '/' + onlyfiles[i], dtype=np.uint8)
+for i in tqdm(list(range(len(onlyfiles)))):
+    filearray = np.fromfile(onlyfiles[i], dtype=np.uint8)
     #Check DX7 MK1 sysex header.
     compare = filearray[0:6] == np.array([0xF0, 0x43, 0x00, 0x09, 0x20, 0x00])
     #Check file size.

Nonetheless, I still can't get the same numbers as the DX7-JAX preprocessing :\

Now I'm at:

Processed 303744 patches. 272817 similar patches filtered.
Compiled patch dataset contains 30927 patches.

I'd love to be able to using dx7pytorch because it's so convenient and fast!

turian commented 2 weeks ago

Studying @DBraun 's code a little more:

that code ignores the SYX header, versus yours which enforces it must be np.array([0xF0, 0x43, 0x00, 0x09, 0x20, 0x00])
in that code, if the SYX file is too short, all voices that can be extracted ARE extracted

I am just curious if you have any intuition about whether these less strict assumptions are safe or not?

fcaspe commented 2 weeks ago

Hey Joseph! Hope you are doing great! Good catch on the file extensions.

It's nice that you are interested in this (very old) project! The initial array is the SYSEX ID that identifies the patch as a DX7 MK1. It is not integrated in the hash calculation (check patchpacker.py ).

I just make sure that the length of each file is correct (4104 bytes). You could also make sure the checksum adds up for each processed syx, and that the values are in range. This code does not check the values are in range, it just forces the patches to be within range by applying bitmasks (check dxdataset.py ).

It's very cool that the voices that you can salvage are also extracted. But if you don't check the sanity of the file what's the point right? You could just make up random patches too to use as data augmentation.

For uniqueness check I CRC dx7 memory dumps of 32 patches, but as you say, this code does not capture the smaller syx files. I assign one CRC32 per each patch (one 32-bit word extracted from 119 bytes), so it would be better to check every patch byte by byte, to avoid the potential pitfall where different patches fall into the same CRC. That would certainly increase the number of unique hits, especially if processing also the smaller files. I haven't checked how many syx files are rejected so the number of patches could be much bigger.

I have to warn you though, the C++ synthesis engine in this project is based on the older Hexter FM synth and not on the current Dexed. There was a similar project using a Dexed synth but without pytorch support. You could try and wrap it around a torch dataset if you want to use a newer synth engine. https://github.com/bwhitman/learnfm

Hope this helps! Fran

fcaspe commented 2 weeks ago

I was just thinking about this so I have just updated my previous comment with a correction. In a nutshell, you could get more unique patches if you check every patch byte by byte instead of with a CRC32, and also process the smaller syx files as you were saying! Let me know if you have any other questions ! :) Fran

turian commented 2 weeks ago

I suspect the CRC32 is fine, there won't be many collisions.

I am not sure whether the strict or permissive approach to interpreting SYX is better. Strict seems safer.

fcaspe / dx7pytorch

Too few patches? (.syx being clobbered?) #1