Open sorokivski opened 1 month ago
The absence of data in certain parts of the matrix is due to the fact that we only have a file available for training, and no file is available for testing.
following up on that, this is a list of eliminated species codes:
["mh", "ms", "re", "st", "to", "tp", "wa", "wg", "wn", "na", "00", "01", "02", "03", "04", "05"]
updated list with ALL species that only appear once: eliminated_species_codes = ["gp", "zd", "cr", "gl", "bl", "ge", "lg", "q.", "hm", "gp"]
1. Data Formatting
After the metadata extraction and cleanup, data structure the data into the following format:
GOAL: Standardize all data into a clean, structured format that can be easily used in further analysis and machine learning tasks.
2. Data Analysis
evaluating the quality and distribution of the collected data, identify outliers, and explore patterns before using it for modeling
Analyzing the Input Data:
Outliers Detection: Identify any data points that are abnormally high or low in terms of frequency ranges or call durations. These outliers could indicate errors in the recording process or rare species behaviors.
Clustering: Group similar recordings based on frequency ranges, location, or species to better understand the relationships within the data. This step will help us determine if there are distinct patterns in the migration calls.
Initial Visualization:
GOAL: Understand the overall structure and quality of the data through exploratory analysis, identify any issues or anomalies, and ensure the dataset is suitable for model training.