brahmwg / Bottlenecks_MDS_Capstone

Master of Data Science Capstone Project for Bottlenecks to Survival
0 stars 0 forks source link

Prediction model - Missing species #4

Closed riyaeliza123 closed 2 months ago

riyaeliza123 commented 4 months ago

Details : https://github.com/brahmwg/Bottlenecks_MDS_Capstone/blob/main/deliverables/species_prediction_model.md (Data that can be used is detailed above)

Data objective: Use tagging location data (plus any other required data)

Output objective: Understand missing species data

Output:

  1. Suggested species with a probability
  2. (probably) Develop a framework to integrate this predictive model and develop this "cut-off"

Example output: For genetic stock assignment, which utilizes a Bayesian informed model, there is a 0.75 probability minimum value that needs to be met for inclusion in analysis.

riyaeliza123 commented 4 months ago

Dataset: Tables to use: field, microtroll, cleaning Some union/join from pit_tag

Model: Decision tree, DL model (more features)

Dashboard: Dash app

riyaeliza123 commented 4 months ago

When these conditions are added to the database the o/p for species are the foll:

riyaeliza123 commented 4 months ago

took some dummy data and created pipeline for a decision tree, plotted decision tree

riyaeliza123 commented 4 months ago

the DL model using tensorflow is complete. The prediction has been done.

riyaeliza123 commented 4 months ago

The prediction however is in decimal format.

array([[0.39008173, 0.0754076 ],
       [0.5425115 , 0.04349737],
       [0.22640109, 0.11200137],
       [0.24807785, 0.11598497],
       [0.28531218, 0.11743941]],

The format we want is

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

The 2 methods I think we can use to achieve this (and my reasoning for/against):

  1. We convert all decimals to int, that way they'll be a whole number (1 or 0). Cannot give a label for sure (all may be 0). This could be a good option to make sure that we are not naively assigning labels
  2. The prediction needs to be converted to 1 and 0. For each row, the highest value is the label and the "predicted" number is the confidence level. This method makes sure that model will give a label for each entry. May naively assign labels to all entries.

I think we should go with 1 and if no labels are assigned, that means confidence is very low and we output the result as "none" / "cannot determine"

riyaeliza123 commented 4 months ago

All files (notebook, data and Decision tree plot) have been pushed

riyaeliza123 commented 4 months ago

Remaining:

  1. The data has to be discussed, and updated
  2. The final prediction method has to be discussed
riyaeliza123 commented 4 months ago

tasks:

Reminder: Goal is to be able to accurately impute data. Interpretability is an added appreciated feature.

riyaeliza123 commented 4 months ago
riyaeliza123 commented 4 months ago
riyaeliza123 commented 4 months ago

SELECT pit.tag_id_long, field.watershed, field.river, field.site, field.method, field.local, field.water_temp_start, field.species, field.fork_length_mm FROM pit_tag pit INNER JOIN field ON pit.tag_id_long = field.tag_id_long

https://marinescience.info/sqllab/?savedQueryId=51

Explore genetics_field also

AReyH commented 4 months ago
species  count
rbt 979
ct  515
cm  10
co  27187
so  4
bt  77
stl 1765
ck  31810
riyaeliza123 commented 4 months ago

Questions to explore: (feature importance)

riyaeliza123 commented 3 months ago
SELECT field.watershed, 
        field.river, field.site, 
        field.method, field.local, 
        field.water_temp_start, 
        field.fork_length_mm, field.species
FROM field