Feat : limit the number of audio files to keep per species

alexbelgium commented 3 months ago

Keep only xx files from a specie (https://github.com/Nachtzuster/BirdNET-Pi/discussions/102). This is especially useful for non-dedicated systems such as VM ; addons ; docker-containers where we don't necessarily want to fill up the whole space

alexbelgium commented 3 months ago

This should also help the backup settings by limiting the disk usage to more normal values, knowing that I don't see a normal use case where people would need more than 1k mp3 files of a bird... I had 47k for a single bird

alexbelgium commented 3 months ago

Ok I think everything is good! Everynight at 2am it will run the script to keep at most x files per species (default is everything, the value can be set in the app settings).

Deletion priority :

Files with lowest detection score
Files that are the oldest

Protected from deletions :

Files other than audio or their linked spectrograms
The last 7 days recordings
The files specifically defined to be protected from purge

alexbelgium commented 3 months ago

Didn't see there was already one in advanced, I'll move the settings there

alexbelgium commented 3 months ago

In theory everything should work :-)

Nachtzuster commented 3 months ago

should we really make max_disk_usage configurable? I do not see a sane use-case for that.

alexbelgium commented 3 months ago

Well I don't use it, it was more a matter of opportunity ;) I was thinking that it could be useful for people on docker systems or VM who would want to limit to say 10% storage or something like that...

But then I can remove it from the PR as in the end the max number of files per species will give a similar result in a more element manner and it avoids complexifying the config

Thanks for the review!

alexbelgium commented 3 months ago

max_disk_usage is removed, now the PR "only" allows to configure, install and execute a cron to optionaly keep x files per species

alexbelgium commented 2 months ago

Hi, thanks for the review!

The find command is written that way to be efficient : by using a single pipe instead of temporary information storage, it allows to treat very large number of values in a very efficient and fast manner. It took around 30s to manage around 149k files ; while other methods I've done (storing sorted data in a temporary file ; having a subshel...) led to more than 1m30 for around 30k files. So I didn't do a head to head comparison but this code should manage around 5k observations/seconds compared to 0.3k observations per seconds for a split manner. Of course if you see a better way I would be very interested !

I was thinking that users could have up to 120go of data that should be cleaned if they want to remove files from a dedicated rpi, which would mean an enormous amount of files...

Here is how the find works :

# 1. LISTING ALL AUDIO
##################
# In the base folders (that corresponds to the BirdSongs/By_date)
# Look for all folders that have the correct species names whatever the date
# Look for files that have the correct format (containing a date), and that have an extension
    find */"$species" -type f name "*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*.*" \

# EXCLUDE FILES THAT ARE PROTECTED
##############################
# That are not *.png (as the objective is to limit the number of audio files)
        -not -name "*.png" \
# That were not taken in the past 7 days (= that don't contain the date from that past 7 days). $dateformat is configured as a different variables, as ubuntu accepts "5 days" while alpine accepts only "5"
        -not -name "*$(date -d "-7$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-6$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-5$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-4$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-3$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-2$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date -d "-1$dateformat" '+%Y-%m-%d')*" \
        -not -name "*$(date '+%Y-%m-%d')*" |
# That are not included in the file disk_check_exclude.txt that lists files protected from purge
        grep -vFf "$HOME/BirdNET-Pi/scripts/disk_check_exclude.txt" |

# SORTING THE FILES (example 2024-01-01/Bird name/Bird_name=name2-73-2024-07-06-birdnet-RTSP_1-18:19:08.mp3)
################
# If the specie name had a "-" in it, it must be converted to "=" to ensure that we have always the same number of "-" separated fields in the filename
        sed "s|$species|$species_san|g" |
# Sort by confidence level (field 4 separated by -)
# Sort by date (1 for year, 2 for month, 3 for days)
        sort -t'-' -k4,4nr -k1,1nr -k2,2nr -k3,3nr |
# REMOVING UNWANTED FILES
########################
# Remove the top x files, corresponding to the files best matching the criteria of confidence + age ; this corresponds to the number of file to keep (in addition to protected files
        tail -n +"$((max_files_species + 1))" |
# Rename species that had a = in their name to - (we don't need anymore - separated fields)
        sed "s|$species_san|$species|g" |
# Duplicate all lines to append .png at the end, to remove the linked png
        sed 'p; s/\(\.[^.]*\)$/\1.png/' |
# This appends a fake "temp" file, so that the sudo rm has at least one file to delete and does not hang
        awk 'BEGIN{print "temp"} {print}' |
# Delete files, then once all files are deleted echo the number of remaining files
        xargs sudo rm && echo "success ($(find */"$species" -type f -name "*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*.*" \
        -not -name "*.png" | wc -l)) remaining" || echo "failed ($?)"

alexbelgium commented 2 months ago

Update vs your comments (thanks!):

[ ] : Finally, the find construct looks pretty terrifying 😄 I would like that to be more understandably and debug-able. do you think you could maybe split it up? ; I've described the breakdown of each line above, it's that way for speed and efficiency... would it be better (and as fast) in another language like python ?
[x] : The script is missing the executable bit : thanks!
[x] : you'll probably need some logic in update_snippets to update the cron job ; done I've set a way to be reused if we again update cron in the future by looking at the number of #birdnet occurence
[x] : I tried to split off a bit of the settings page text since some explanation goes for the purge mechanism too : thanks!

Nachtzuster commented 2 months ago

Great explanation, thanks. We should indeed take in account large collections of data. Since this script it to be run through cron, I would not be too worried about runtime though (within reasonable bounds). I would be worried more about memory use, but memory usage for this script is probably not too bad either: /usr/bin/time -v ./scripts/disk_species_clean.sh gives me Maximum resident set size (kbytes): 9848 for a 90k dataset

That being said, I would still prefer a simpler script, but with the explanation, we could still revisit if it turns out to be an issue.

also: those () should not be there, I think

alexbelgium commented 2 months ago

Thanks for your input - the only alternative I had tested was using an intermediate txt file to store the find output, and sort using awk ; but it was really slow...

Thanks for the update*snippets.sh error, I've corrected it !

Nachtzuster commented 2 months ago

I'm getting:

I'm sure we're almost there! Could you just check & add the one disk_species_clean.sh line in /etc/crontab?

BTW VM's and especially clones of ready-to-go VM's are a great way to test upgrade scripts

alexbelgium commented 2 months ago

Thanks indeed my cron update script incorrectly thought that $my_dir referred to the same location in the update_snippets*.sh and in install_birdnet.sh... Pretty confusing actually. Referring to files locations using $HOME looks indeed a much better practice for homogeneity in code & logic.

I've still used the complete removal of all birdnet-specific cron ; and reinstall of all of them again... I was seeing that as more resilient in the sense that it would re-add any cron mistakenly removed by the user and make again a clean state. If you prefer of course I can use a grep command to simply check for presence of disk_species_clean.sh and add the specific line. Please tell me which logic to keep in the final version.

Thanks for the point about VM indeed I test the updates by curl-ing the new files on my add-on but this means that I miss things like +x permissions and the update_snippets check. I'll see how to modify the way of working and your proposal of ready to sue b-pi images.

Thanks!

Nachtzuster commented 2 months ago

I'll see how to modify the way of working and your proposal of ready to sue b-pi images.

To be clear: I'm running x86_64 vm's on my x86_64 development machine. Being able to do exactly this was the main reason for me to get BirdNET-Pi transparently running on x86_64.

Nachtzuster / BirdNET-Pi

Feat : limit the number of audio files to keep per species #121