Prediction model - Missing species

riyaeliza123 commented 4 months ago

Details : https://github.com/brahmwg/Bottlenecks_MDS_Capstone/blob/main/deliverables/species_prediction_model.md (Data that can be used is detailed above)

Data objective: Use tagging location data (plus any other required data)

Output objective: Understand missing species data

Output:

Suggested species with a probability
(probably) Develop a framework to integrate this predictive model and develop this "cut-off"

Example output: For genetic stock assignment, which utilizes a Bayesian informed model, there is a 0.75 probability minimum value that needs to be met for inclusion in analysis.

riyaeliza123 commented 4 months ago

Dataset: Tables to use: field, microtroll, cleaning Some union/join from pit_tag

Model: Decision tree, DL model (more features)

Dashboard: Dash app

riyaeliza123 commented 4 months ago

When these conditions are added to the database the o/p for species are the foll:

when watershed IS NOT NULL -> species = co, rbt
when weight_g IS NOT NULL -> species = stl

riyaeliza123 commented 4 months ago

took some dummy data and created pipeline for a decision tree, plotted decision tree

riyaeliza123 commented 4 months ago

the DL model using tensorflow is complete. The prediction has been done.

riyaeliza123 commented 4 months ago

The prediction however is in decimal format.

array([[0.39008173, 0.0754076 ],
       [0.5425115 , 0.04349737],
       [0.22640109, 0.11200137],
       [0.24807785, 0.11598497],
       [0.28531218, 0.11743941]],

The format we want is

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

The 2 methods I think we can use to achieve this (and my reasoning for/against):

We convert all decimals to int, that way they'll be a whole number (1 or 0). Cannot give a label for sure (all may be 0). This could be a good option to make sure that we are not naively assigning labels
The prediction needs to be converted to 1 and 0. For each row, the highest value is the label and the "predicted" number is the confidence level. This method makes sure that model will give a label for each entry. May naively assign labels to all entries.

I think we should go with 1 and if no labels are assigned, that means confidence is very low and we output the result as "none" / "cannot determine"

riyaeliza123 commented 4 months ago

All files (notebook, data and Decision tree plot) have been pushed

riyaeliza123 commented 4 months ago

Remaining:

The data has to be discussed, and updated
The final prediction method has to be discussed

riyaeliza123 commented 4 months ago

tasks:

[x] DT does not need "scaled" data - make that correction to make DT more interpretable
[x] Find optimum "depth" of tree using grid-search or any related technique
[ ] Make prediction interpretable
[x] Finalize dataset

Reminder: Goal is to be able to accurately impute data. Interpretability is an added appreciated feature.

riyaeliza123 commented 4 months ago

[x] Readjust data pipeline - Riya
[x] Code grid search (DT) - Riya
[ ] Feature selection code - Arturo
[x] Join temp data - Arturo
[x] Finalizing the data - Riya and Arturo (tomorrow)
[x] Create separate notebooks for DT and DL - Riya

riyaeliza123 commented 4 months ago

[x] Data done - temperature data - Arturo
[x] decide which columns to keep for the dataset - Riya, Arturo
[ ] deciding how to make prediction interpretable - Riya

riyaeliza123 commented 4 months ago

SELECT pit.tag_id_long, field.watershed, field.river, field.site, field.method, field.local, field.water_temp_start, field.species, field.fork_length_mm FROM pit_tag pit INNER JOIN field ON pit.tag_id_long = field.tag_id_long

https://marinescience.info/sqllab/?savedQueryId=51

Explore genetics_field also

AReyH commented 4 months ago

species  count
rbt 979
ct  515
cm  10
co  27187
so  4
bt  77
stl 1765
ck  31810

riyaeliza123 commented 4 months ago

Questions to explore: (feature importance)

[x] Does river affect species?
[x] Does site affect species?
[x] Does method affect species?
[x] water_temp v/s species EDA

riyaeliza123 commented 3 months ago

SELECT field.watershed, 
        field.river, field.site, 
        field.method, field.local, 
        field.water_temp_start, 
        field.fork_length_mm, field.species
FROM field

brahmwg / Bottlenecks_MDS_Capstone

Prediction model - Missing species #4