Open sorokivski opened 2 months ago
This is an excerpt from the csv file. The frist columns where gained from the filenames of the audiofiles. The last column (species) is gained by using the species-code and the species mapping (see dataset.py).
The data still requires some cleaning.
DATA PREPROCESSING
acquiring, cleaning, formatting, and analyzing the data to ensure it is usable for our downstream machine learning tasks
1. Data Acquisition Step
The data acquisition process for this project involves extracting meaningful metadata from filenames and the corresponding audio files that capture bird migration calls. Here’s how we handle it:
Extract Metadata from Filenames:
filenames contain encoded information, such as location, frequency ranges, species identifiers, and date/time data. We parse these filenames to extract the relevant metadata programmatically.
Example Filename Format:
2459626.192622_Tautenburg___6589-9171kHz___10-10.9s___b.wav
Example Parsed Fields:
GOAL: Automate the extraction of this metadata to store it in a structured format (CSV or database), which will be used for further analysis and model training.
2. Data Clearance
The data collected contains various issues that need to be addressed before it can be used for analysis. These issues include faulty or missing information, particularly errors in frequency data, misformatted filenames, and missing metadata.
Handle File Parsing Errors:
GOAL: Ensure all filenames adhere to a consistent structure and all necessary metadata is extracted and corrected