Per recent correspondence with Liu Feng, the authors cannot open-source their implementation of coarse-to-fine alignment. He did share this high-level summary (slightly copyedited):
"step1: cut the song as short chunks,
step2: use coarse model to compute similarity of every chunk and anchor song.
step3: use some algorithm such as greedy search(or others) to get more accurate alignment
step4: use the more accurate data to finetune coarse model (maybe training the fine model from scratch is also ok)
For example, you have songA and its cover version(as songB), and songA has 120s, songB has 360s. Maybe songB has different intro or bridge or verse (songB only has a segment covered of songA). You can cut songB with 15s, and some segments of songB can be detected as a cover of songA, e.g. 100-200s. So you use songB(100-200s) instead of songB(0-360s), you can get a better model."
As described in the CoverHunter research paper at https://ar5iv.labs.arxiv.org/html/2306.09025
Per recent correspondence with Liu Feng, the authors cannot open-source their implementation of coarse-to-fine alignment. He did share this high-level summary (slightly copyedited):
"step1: cut the song as short chunks, step2: use coarse model to compute similarity of every chunk and anchor song. step3: use some algorithm such as greedy search(or others) to get more accurate alignment step4: use the more accurate data to finetune coarse model (maybe training the fine model from scratch is also ok) For example, you have songA and its cover version(as songB), and songA has 120s, songB has 360s. Maybe songB has different intro or bridge or verse (songB only has a segment covered of songA). You can cut songB with 15s, and some segments of songB can be detected as a cover of songA, e.g. 100-200s. So you use songB(100-200s) instead of songB(0-360s), you can get a better model."