bytedance / GiantMIDI-Piano

1.71k stars 177 forks source link

Error when downloading MP3 audio data #6

Open andres-fr opened 2 years ago

andres-fr commented 2 years ago

Hi!

Thanks for this amazing work.

I've encountered a small issue while downloading the audio data. My first impression is that it is related to files not being available anymore. Here is the log:

python3 dataset.py download_youtube_piano_solo --workspace=$WORKSPACE --begin_index=0 --end_index=30000
[nltk_data] Downloading package punkt to /home/aferro/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/aferro/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
0; Jag A.; Je t'aime Juliette; Je t'aime Juliette - A. Jag
ERROR: Video unavailable

[]
1; C. A. Aadler; Floating Islands; Mind-Boggling Off-Grid FLOATING Island HOMESTEAD
Traceback (most recent call last):
  File "dataset.py", line 660, in <module>
    download_youtube_piano_solo(args)
  File "dataset.py", line 565, in download_youtube_piano_solo
    if float(meta_dict['piano_solo_prob'][n]) >= 0.5:
ValueError: could not convert string to float: ''

As we can see, the meta_dict['piano_solo_prob'][n] entry is expected to yield a string that can be casted to a float, i.e. something like "0.12345". But sometimes it yields empty strings, which cannot be casted into floats.

Without analyzing too much of the code, a possible fix could be the following:

try:
    prob = float(meta_dict['piano_solo_prob'][n])
except ValueError as ve:
    print("SKIPPING ENTRY DUE TO ERROR:", ve)
    n += 1
    continue

if prob >= 0.5:
    count += 1
    ...etc

So far this seems to run OK on my end, yielding the desired audio data and logs under "$WORKSPACE", but I'm not sure if we're supposed to ignore the empty n entries, or rather fix them so no empty entries are provided. What do you think? If this looks OK feel free to use the code, or let me know if you'd like me to do a PR.

Cheers,
Andres

qiuqiangkong commented 2 years ago

Hi Andres,

Thank you very much for pointing out the bugs! I believe your solutions

are right!

To acquire the transcribed files, please see here:

https://github.com/bytedance/GiantMIDI-Piano/blob/master/README.md

Please let me know if there are further questions!

Best wishes,

Qiuqiang

On Sun, 10 Apr 2022 at 11:15, Andres Fernandez @.***> wrote:

Hi!

Thanks for this amazing work.

I've encountered a small issue while downloading the audio data. My first impression is that it is related to files not being available anymore. Here is the log:

python3 dataset.py download_youtube_piano_solo --workspace=$WORKSPACE --begin_index=0 --end_index=30000 [nltk_data] Downloading package punkt to /home/aferro/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /home/aferro/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date! 0; Jag A.; Je t'aime Juliette; Je t'aime Juliette - A. Jag ERROR: Video unavailable

[] 1; C. A. Aadler; Floating Islands; Mind-Boggling Off-Grid FLOATING Island HOMESTEAD Traceback (most recent call last): File "dataset.py", line 660, in download_youtube_piano_solo(args) File "dataset.py", line 565, in download_youtube_piano_solo if float(meta_dict['piano_solo_prob'][n]) >= 0.5: ValueError: could not convert string to float: ''

As we can see, the meta_dict['piano_solo_prob'][n] entry is expected to yield a string that can be casted to a float, i.e. something like "0.12345". But sometimes it yields empty strings, which cannot be casted into floats.

Without analyzing too much of the code, a possible fix could be the following:

try: prob = float(meta_dict['piano_solo_prob'][n]) except ValueError as ve: print("SKIPPING ENTRY DUE TO ERROR:", ve) n += 1 continue if prob >= 0.5: count += 1 ...etc

So far this seems to run OK on my end, yielding the desired audio data and logs under "$WORKSPACE", but I'm not sure if we're supposed to ignore the empty n entries, or rather fix them so no empty entries are provided. What do you think? If this looks OK feel free to use the code, or let me know if you'd like me to do a PR.

Cheers, Andres

— Reply to this email directly, view it on GitHub https://github.com/bytedance/GiantMIDI-Piano/issues/6, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFXTSJUZ2FPIITNNLB4A2DVEJBTZANCNFSM5TAFPCSA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

andres-fr commented 2 years ago

Dear Qiuqiang,

Thanks for the quick reply. I was indeed following those instructions, and the issues encountered were within the dataset.py script. I just have a few remarks/questions before closing this issue.

As of April 11, 2022, I was able to download 9825 mp3 files, as opposed to your 10855. That would be 90% of the original size, i.e. 10% loss of data in 3-4 months, right? So does this dataset share the same issues as AudioSet regarding data loss?

But the most important question for me is:

If I want to pair the audios to MIDIs, can I still use the MIDI files that you rendered in January, or do I have to re-render them again? (i.e. are my 9825 files a strict subset of your 10855 or are we doing some soft matching?)

Cheers and thanks again!
Andres

qiuqiangkong commented 2 years ago

Hi Andres,

The 10855 files are downloaded in Apr. 2020. So it is a 10% loss of

data in 2 years. This dataset share the same issue as AudioSet regarding data loss which is also approximately a 10% loss.

The GiantMIDI-Piano released in January applies the same audio files

downloaded in Apr. 2020. So it is encouraged to use the provided version.

Best wishes,

Qiuqiang

On Tue, 12 Apr 2022 at 05:39, Andres Fernandez @.***> wrote:

Dear Qiuqiang,

Thanks for the quick reply. I was indeed following those instructions, and the issues encountered were within the dataset.py script. I just have a few remarks/questions before closing this issue.

As of April 11, 2022, I was able to download 9825 mp3 files, as opposed to your 10855. That would be 90% of the original size, i.e. 10% loss of data in 3-4 months, right? So does this dataset share the same issues as AudioSet regarding data loss?

But the most important question for me is:

Can I still use the MIDI files rendered in January, or do I have to re-render them again? (i.e. are my 9825 files a strict subset of your 10855 or are we doing some soft matching?)

Cheers and thanks again! Andres

— Reply to this email directly, view it on GitHub https://github.com/bytedance/GiantMIDI-Piano/issues/6#issuecomment-1095593365, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFXTSJOA6VT6E6CVFHHZTDVESLY7ANCNFSM5TAFPCSA . You are receiving this because you commented.Message ID: @.***>

andres-fr commented 2 years ago

Thank you for your reply. Indeed 10% in 2 years is much better than what I was picturing.

In the last 3 days, I was able to render the MIDIs for the 9825 files from a laptop with a rtx3070 GPU, running 3 processes in parallel (since each process took 2350MB of GPU memory I was able to fit 3 of them in 7000MB).

Therefore, my issues with consistency are solved, the only problem left would be the difficulty of comparing evaluations among different dataset versions (analogous to AudioSet).

The only "clean" solution, as you mentioned, would be to upload the original version with both MIDI and audio data. The audio files take ca. 25GB, which shouldn't be that much in terms of size, but as always there may be issues with intellectual property involving uploading music to the web. Do you know if that could be problematic?

In my case this is not such a big deal (I can do benchmarking with other datasets), so please feel free to close this issue if you want so.

Cheers and thanks again!
Andres

andres-fr commented 2 years ago

Hi again @qiuqiangkong! did you have the chance to check the PR I made? I didn't see any contrib guidelines, if it is not welcome I'm happy to withdraw it. Regarding license, I didn't specify, but whatever you'd prefer is fine by me