SR-71-group / birdanalysis

science-camp-project for migration birds calls
Apache License 2.0
0 stars 0 forks source link

csv processing #2

Open sorokivski opened 2 months ago

sorokivski commented 2 months ago

DATA PREPROCESSING

acquiring, cleaning, formatting, and analyzing the data to ensure it is usable for our downstream machine learning tasks

1. Data Acquisition Step

The data acquisition process for this project involves extracting meaningful metadata from filenames and the corresponding audio files that capture bird migration calls. Here’s how we handle it:

Extract Metadata from Filenames:

filenames contain encoded information, such as location, frequency ranges, species identifiers, and date/time data. We parse these filenames to extract the relevant metadata programmatically.

Example Filename Format: 2459626.192622_Tautenburg___6589-9171kHz___10-10.9s___b.wav

Example Parsed Fields:

julian-date | location | low_freq | high_freq | start | end | species

GOAL: Automate the extraction of this metadata to store it in a structured format (CSV or database), which will be used for further analysis and model training.

2. Data Clearance

The data collected contains various issues that need to be addressed before it can be used for analysis. These issues include faulty or missing information, particularly errors in frequency data, misformatted filenames, and missing metadata.

Handle File Parsing Errors:

  1. Inconsistent filename separators (e.g., - instead of _) are handled by programmatically replacing incorrect characters.
  2. Missing or malformed parts of the filename (such as missing frequency data or species codes) are flagged for manual correction or filled with default values.

GOAL: Ensure all filenames adhere to a consistent structure and all necessary metadata is extracted and corrected

mirahse commented 2 months ago

csv-results

This is an excerpt from the csv file. The frist columns where gained from the filenames of the audiofiles. The last column (species) is gained by using the species-code and the species mapping (see dataset.py).

The data still requires some cleaning.