fastaudio / fastai2_audio

[DEPRECATED] Audio Module for fastai v2
Apache License 2.0
65 stars 15 forks source link

Git Hook to remove outputs containing audio #21

Open mogwai opened 4 years ago

mogwai commented 4 years ago

To reduce the repo size and make working with reviewnb better

PranY commented 4 years ago

Hi, I'm trying to fully understand this issue in my attempt to help. If nbdev is used in this library then installing git hooks using nbdev_install_git_hooks() does exactly what we need?

Is this issue pointing to having an option to remove outputs via a function call, or is it about adding something like nbdev or nbstripout in the library?

If I don't make any sense, please correct me.

mogwai commented 4 years ago

The issue is that he nbdev_clean command doesn't strip the outputs. If for example you have a notebook that has a 1 hour audio preview in it then in the ipynb file, you will have that base64 encoded data in the file. This leads to incompatibility with reviewnb etc.

PranY commented 4 years ago

I think I'm still not sure I understand it completely. Below is a snapshot of my test, look at the file size. image

I also opened that notebook in jupyter and checked, all outputs were cleared.

If we only want to delete the audio-data from the output and keep the rest, then that's a bit challenging but doable, I guess. The source code for nbdev_clean_nbs calls clean_nb for the file(s) provided which in turn calls clean_cell for all cells. We can agree on a qualifier #something for cells and write an extension to 'nbdev_clean_nbs`.

Please help me understand it better.

mogwai commented 4 years ago

That's exactly the problem. nbdev_clean_nbs without --clear_all True will keep the outputs but remove other information that isn't needed. I haven't dug into what it does exactly but I'll try to soon. The idea I had was to parse the html output and remove the data in the src attribute from the audio tags if it is bigger than a certain size.

I'd say leave this issue to me to sort out but thanks for taking an interest in it.

scart97 commented 4 years ago

@mogwai Is this necessary now that the notebooks are small ?

mogwai commented 4 years ago

Well the problem is that because the notebooks were large before, the git history is tainted with the data in order to make sure that you can checkout previous commits. This means that a git clone will still take a a lot of time.

The whole point of this hook is to prevent contributors from accidentally adding audio data into the notebooks as we essentially can't remove it easily once its tracked.

I'm happy to drop the issue for now but I think it would be useful to explore creating this at some point in the future.

What are you're thought @scart97?