SR-71-group / birdanalysis

science-camp-project for migration birds calls
Apache License 2.0
0 stars 0 forks source link

data preprocessing #3

Open sorokivski opened 1 month ago

sorokivski commented 1 month ago

1. Data Formatting

After the metadata extraction and cleanup, data structure the data into the following format:

filename | loc | low_freq | high_freq | start | end | species | year | month | day | time | part_of_day

GOAL: Standardize all data into a clean, structured format that can be easily used in further analysis and machine learning tasks.

2. Data Analysis

evaluating the quality and distribution of the collected data, identify outliers, and explore patterns before using it for modeling

Analyzing the Input Data:

  1. Outliers Detection: Identify any data points that are abnormally high or low in terms of frequency ranges or call durations. These outliers could indicate errors in the recording process or rare species behaviors.

  2. Clustering: Group similar recordings based on frequency ranges, location, or species to better understand the relationships within the data. This step will help us determine if there are distinct patterns in the migration calls.

  3. Initial Visualization:

    • Create simple plots (e.g., histograms, scatter plots) to visualize
    • Distribution of frequency ranges across species
    • Recording density by location and time of day

GOAL: Understand the overall structure and quality of the data through exploratory analysis, identify any issues or anomalies, and ensure the dataset is suitable for model training.

KianTavakoli commented 1 month ago

Image The absence of data in certain parts of the matrix is due to the fact that we only have a file available for training, and no file is available for testing.

mirahse commented 1 month ago

following up on that, this is a list of eliminated species codes:

["mh", "ms", "re", "st", "to", "tp", "wa", "wg", "wn", "na", "00", "01", "02", "03", "04", "05"]

updated list with ALL species that only appear once: eliminated_species_codes = ["gp", "zd", "cr", "gl", "bl", "ge", "lg", "q.", "hm", "gp"]