Clemens-Bretscher / cassava-classification-capstone

MIT License
0 stars 0 forks source link

cassava Source: www.wikipedia.org

Cassava Disease Classification

Algorithm: Deep Learning - Neural Networks

Table of Content:

  1. Brief Introduction to:
  2. + Cassava Diseases
    + Stakeholders
  3. Hypothesis and Model Building
  4. Exploratory Data Analysis
  5. Baseline Model Selection
  6. Model Training and Validation

Cassava Plant

As the second-largest provider of carbohydrates in Africa, cassava is a key food security crop grown by smallholder farmers because it can withstand harsh conditions. At least 80 percent of household farms in Sub-Saharan Africa grow this starchy root, but viral diseases are major sources of poor yields. With the help of data science, it may be possible to identify common diseases so they can be treated.

Existing methods of disease detection require farmers to solicit the help of government-funded agricultural experts to visually inspect and diagnose the plants. This suffers from being labor-intensive, low-supply and costly. As an added challenge, effective solutions for farmers must perform well under significant constraints, since African farmers may only have access to mobile-quality cameras with low-bandwidth.

Cassava Diseases

Major Diseases:

There are about four known diseases of cassava plant among them CMD is the most prevalent one:

>
    >
  1. CBB: Cassava Bacterial Blight
  2. >
  3. CBSD: Cassava Brown Streak Disease
  4. >
  5. CGM: Cassava Green Mottle
  6. >
  7. CMD: Cassava Mosaic Disease

Stakeholders

[Our Stakeholder:]("Beautifull_soup.ipynb")

We selected The Ministry of Agriculture of Uganda as our stakeholder, for it has a direct relationship with the farmers though its extension workers and agricultural experts.

Overview of Uganda: >

Location and Population: Uganda is a landlocked nation located in East Africa with population about 20 million.

>

Arable land: Over 25 percent considered suitable for agriculture, which is much higher than the average for sub-Saharan Africa (6.4 percent).

>

GDP: Agriculture accounts for more than 60 percent, 98 percent of export earnings and over 40 percent of government revenue.

>

Farming and Income: Farming is labour intensive, with women and children providing 60–80 percent of the labour and crops are cultivated both as cash and food security crops.

Bussiness and Data Models

Business Model: >Target: >
    >
  1. High yield of Cassava as cash and food crop
  2. >
  3. Early detection of disease

Data Model:

Target:

  1. Min. loss (cost) function
  2. Lower false Negative

From Stakeholder Perspective:

High false negative: Implies severe impact on the livelihood of these subsistence farmers. It creates a false impression as if the crops are healthy. This will prevent the stakeholders not to take preventive measures prematurely. Disease will spread → famines were happening in the past.

False positive: Too much cassava will be destroyed although they are healthy (loss in income).

To balance the two short comings we will use F-score that is the harmonic mean of precision and recall. Due to its bias as an evaluation metric F1-score would not a good score to measure accuracy, because recall and precision are evenly weighted.

The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The beta parameter determines the weight of recall in the combined score. beta less than 1 lends more weight to precision, while beta greater than 1 favors recall (beta near to zero considers only precision, beta near to +inf only recall).

The two other commonly used F measures are the F2-score, which weights recall higher than precision, and the F0.5-score, which puts more emphasis on precision than recall. Since we want to put more emphasis on recall than on precision the F2-score will be the best metric in our case.

Exploratory Data Analysis

In this project, for the classification of cassava leaves as healthy and unhealthy through deep learning classification algorithms, a dataset of 21,397 labeled images collected during a regular survey in Uganda is introduced (5,656 training set and 15,741 test set). Most images were crowd-sourced from farmers taking photos of their gardens, and annotated by experts at the National Crops Resources Research Institute (NaCRRI) in collaboration with the AI lab at Makerere University, Kampala.

Data Distribution:

Data Imbalance: From our EDA we have observed an imbalance in the dataset, where CMD has 2,658 observations that account for about 46.99 percent, CBB 8.24 percent (466 observations), CBSD 25.51 percent (1,443 observations) CGM 13.67 percent (773 observation) and Healthy 5.59 percent ( observations).

Missing Values: From our data analysis we have see that there are no missing values.

Image Quality: In our dataset we observed images of poor quality that could have impact in our model prediction. To solve this problem, we employed a lagrangian transformation to filter-out blurry images.

Data Cleanness: We have observed that there are parts of cassava plant and other objects that should not belong to the dataset. This also will have to some extent a negative impact on model accuracy during training.

Baseline Models Selection

Baseline Model:

The baseline model is a simple sanity check that consists of comparing one’s estimator against simple rules of thumb. The target is, to beat the dummy classifier that makes predictions using simple rules. DummyClassifier implements simple strategies for classification as:

  • stratified generates random predictions by respecting the training set class distribution.
  • most_frequent always predicts the most frequent label in the training set.

From our dataset the probability of getting CMD is 46.99%, that of CBB 8.24%, CBSD 25.51%, CGM 13.67% and a Healthy one 5.59%. Our baseline model is a probability function where its prediction is based on a label with the highest probability rate which is CMD(3). However, due to the imbalanced nature of our dataset accuracy can be a misleading metric in our modeling.