Closed alexbelgium closed 2 months ago
This should also help the backup settings by limiting the disk usage to more normal values, knowing that I don't see a normal use case where people would need more than 1k mp3 files of a bird... I had 47k for a single bird
Ok I think everything is good! Everynight at 2am it will run the script to keep at most x files per species (default is everything, the value can be set in the app settings).
Deletion priority :
Protected from deletions :
Didn't see there was already one in advanced, I'll move the settings there
In theory everything should work :-)
should we really make max_disk_usage configurable? I do not see a sane use-case for that.
Well I don't use it, it was more a matter of opportunity ;) I was thinking that it could be useful for people on docker systems or VM who would want to limit to say 10% storage or something like that...
But then I can remove it from the PR as in the end the max number of files per species will give a similar result in a more element manner and it avoids complexifying the config
Thanks for the review!
max_disk_usage is removed, now the PR "only" allows to configure, install and execute a cron to optionaly keep x files per species
Hi, thanks for the review!
The find command is written that way to be efficient : by using a single pipe instead of temporary information storage, it allows to treat very large number of values in a very efficient and fast manner. It took around 30s to manage around 149k files ; while other methods I've done (storing sorted data in a temporary file ; having a subshel...) led to more than 1m30 for around 30k files. So I didn't do a head to head comparison but this code should manage around 5k observations/seconds compared to 0.3k observations per seconds for a split manner. Of course if you see a better way I would be very interested !
I was thinking that users could have up to 120go of data that should be cleaned if they want to remove files from a dedicated rpi, which would mean an enormous amount of files...
Here is how the find works :
# 1. LISTING ALL AUDIO
##################
# In the base folders (that corresponds to the BirdSongs/By_date)
# Look for all folders that have the correct species names whatever the date
# Look for files that have the correct format (containing a date), and that have an extension
find */"$species" -type f name "*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*.*" \
# EXCLUDE FILES THAT ARE PROTECTED
##############################
# That are not *.png (as the objective is to limit the number of audio files)
-not -name "*.png" \
# That were not taken in the past 7 days (= that don't contain the date from that past 7 days). $dateformat is configured as a different variables, as ubuntu accepts "5 days" while alpine accepts only "5"
-not -name "*$(date -d "-7$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-6$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-5$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-4$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-3$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-2$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date -d "-1$dateformat" '+%Y-%m-%d')*" \
-not -name "*$(date '+%Y-%m-%d')*" |
# That are not included in the file disk_check_exclude.txt that lists files protected from purge
grep -vFf "$HOME/BirdNET-Pi/scripts/disk_check_exclude.txt" |
# SORTING THE FILES (example 2024-01-01/Bird name/Bird_name=name2-73-2024-07-06-birdnet-RTSP_1-18:19:08.mp3)
################
# If the specie name had a "-" in it, it must be converted to "=" to ensure that we have always the same number of "-" separated fields in the filename
sed "s|$species|$species_san|g" |
# Sort by confidence level (field 4 separated by -)
# Sort by date (1 for year, 2 for month, 3 for days)
sort -t'-' -k4,4nr -k1,1nr -k2,2nr -k3,3nr |
# REMOVING UNWANTED FILES
########################
# Remove the top x files, corresponding to the files best matching the criteria of confidence + age ; this corresponds to the number of file to keep (in addition to protected files
tail -n +"$((max_files_species + 1))" |
# Rename species that had a = in their name to - (we don't need anymore - separated fields)
sed "s|$species_san|$species|g" |
# Duplicate all lines to append .png at the end, to remove the linked png
sed 'p; s/\(\.[^.]*\)$/\1.png/' |
# This appends a fake "temp" file, so that the sudo rm has at least one file to delete and does not hang
awk 'BEGIN{print "temp"} {print}' |
# Delete files, then once all files are deleted echo the number of remaining files
xargs sudo rm && echo "success ($(find */"$species" -type f -name "*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]*.*" \
-not -name "*.png" | wc -l)) remaining" || echo "failed ($?)"
Update vs your comments (thanks!):
Finally, the find construct looks pretty terrifying 😄 I would like that to be more understandably and debug-able. do you think you could maybe split it up?
; I've described the breakdown of each line above, it's that way for speed and efficiency... would it be better (and as fast) in another language like python ?The script is missing the executable bit
: thanks!you'll probably need some logic in update_snippets to update the cron job
; done I've set a way to be reused if we again update cron in the future by looking at the number of #birdnet occurenceI tried to split off a bit of the settings page text since some explanation goes for the purge mechanism too
: thanks!Great explanation, thanks.
We should indeed take in account large collections of data.
Since this script it to be run through cron, I would not be too worried about runtime though (within reasonable bounds).
I would be worried more about memory use, but memory usage for this script is probably not too bad either:
/usr/bin/time -v ./scripts/disk_species_clean.sh
gives me Maximum resident set size (kbytes): 9848
for a 90k dataset
That being said, I would still prefer a simpler script, but with the explanation, we could still revisit if it turns out to be an issue.
also:
those ()
should not be there, I think
Thanks for your input - the only alternative I had tested was using an intermediate txt file to store the find output, and sort using awk ; but it was really slow...
Thanks for the update*snippets.sh error, I've corrected it !
I'm getting:
I'm sure we're almost there! Could you just check & add the one disk_species_clean.sh
line in /etc/crontab
?
BTW VM's and especially clones of ready-to-go VM's are a great way to test upgrade scripts
Thanks indeed my cron update script incorrectly thought that $my_dir referred to the same location in the update_snippets*.sh and in install_birdnet.sh... Pretty confusing actually. Referring to files locations using $HOME looks indeed a much better practice for homogeneity in code & logic.
I've still used the complete removal of all birdnet-specific cron ; and reinstall of all of them again... I was seeing that as more resilient in the sense that it would re-add any cron mistakenly removed by the user and make again a clean state. If you prefer of course I can use a grep command to simply check for presence of disk_species_clean.sh
and add the specific line. Please tell me which logic to keep in the final version.
Thanks for the point about VM indeed I test the updates by curl-ing the new files on my add-on but this means that I miss things like +x permissions and the update_snippets check. I'll see how to modify the way of working and your proposal of ready to sue b-pi images.
Thanks!
I'll see how to modify the way of working and your proposal of ready to sue b-pi images.
To be clear: I'm running x86_64 vm's on my x86_64 development machine. Being able to do exactly this was the main reason for me to get BirdNET-Pi transparently running on x86_64.
Keep only xx files from a specie (https://github.com/Nachtzuster/BirdNET-Pi/discussions/102). This is especially useful for non-dedicated systems such as VM ; addons ; docker-containers where we don't necessarily want to fill up the whole space