eladcn / coronavirus_prediction

This project aims to predict the numbers that are published in each day regarding the amount of Coronavirus (COVID-19) cases and deaths.
GNU General Public License v3.0
34 stars 22 forks source link

IndexError: too many indices for array on some datasets #5

Closed shlima closed 4 years ago

shlima commented 4 years ago
Traceback (most recent call last):
  File "main.py", line 103, in <module>
    model_handler(model_config)
  File "main.py", line 90, in model_handler
    x = training_set[:, 0].reshape(-1, 1)
IndexError: too many indices for array

CSV (Germany)

0,15733
1,15734
2,15735
3,15736
4,15737
5,15738
6,15739
7,15740
8,15741
9,15742
10,15743
11,15744
12,15745
13,15746
14,15747
15,15748
16,15749
17,15750
18,15751
19,15752
20,15753
21,15754
22,15755
23,15756
24,15757
25,15758
26,15759
27,15760
28,15761
29,15762
30,15763
31,15764
32,15765
33,15766
34,15767
35,15768
36,15769
37,15770
38,15771
39,15772
40,15773
41,15774
42,15775
43,15776
44,15777
45,15778
46,15779
47,15780
48,15781
49,15782
50,15783
51,15784
52,15785
53,15786
54,15787
55,15788
56,15789
57,15790
58,15791
59,15792
60,15793
61,15794
62,15795
63,15796
64,15797
65,15798
66,15799
67,15800
68,15801
69,15802
70,15803
71,15804
72,15805
73,15806
74,15807
75,15808
76,122159
77,128244
eladcn commented 4 years ago

I tried using the dataset you provided by replacing the content of the file 'cases_dataset_2020-04-09.csv' and setting the 'grab_data_from_server' property to 'false' under and the cases model and I did not receive any errors.

Can you please write the steps you took that resulted this error?

shlima commented 4 years ago

Hm, the same with me. But I started to get incorrect results for some countries

CSV file for Germany:

0,0
1,0
2,0
3,0
4,0
5,1
6,4
7,4
8,4
9,5
10,8
11,10
12,12
13,12
14,12
15,12
16,13
17,13
18,14
19,14
20,16
21,16
22,16
23,16
24,16
25,16
26,16
27,16
28,16
29,16
30,16
31,16
32,16
33,16
34,17
35,27
36,46
37,48
38,79
39,130
40,159
41,196
42,262
43,482
44,670
45,799
46,1040
47,1176
48,1457
49,1908
50,2078
51,3675
52,4585
53,5795
54,7272
55,9257
56,12327
57,15320
58,19848
59,22213
60,24873
61,29056
62,32986
63,37323
64,43938
65,50871
66,57695
67,62095
68,66885
69,71808
70,77872
71,84794
72,91159
73,96092
74,100123
75,103374
76,107663
77,113296
78,118181

Forecast:

The forecast for Cases in the following 30 days is:
1: 116149
2: 115391
3: 112786
4: 108023
5: 100763
6: 90631
7: 77222
8: 60091
9: 38757
10: 12698
11: -18650
12: -55897
13: -99700
14: -150765
15: -209853
16: -277776
17: -355408
18: -443681
19: -543591
20: -656200
21: -782641
22: -924118
23: -1081912
24: -1257382
25: -1451972
26: -1667208
27: -1904711
28: -2166191
29: -2453457
30: -2768420

Config file:

{
    "models": [
        {
            "model_name": "Cases",
            "polynomial_degree": 7,
            "datagrabber_class": "CasesDataGrabber",
            "grab_data_from_server": false,
            "offline_dataset_date": "0000-00-00",
            "days_to_predict": 30
        },
        {
            "model_name": "Deaths",
            "polynomial_degree": 7,
            "datagrabber_class": "DeathsDataGrabber",
            "grab_data_from_server": false,
            "offline_dataset_date": "0000-00-00",
            "days_to_predict": 30
        }
    ]
}

Chart

photo_2020-04-10 14 39 06

eladcn commented 4 years ago

I see, you are getting these incorrect results because the polynomial degree of your model is too high for your data.

In order to get better results, you need to tweak the "polynomial_degree" hyper-parameter in the config file (this is a trial and error process). For starter, try a polynomial degree of 2, 3 or 4 instead of 7. According to the data visualization you have provided, a polynomial degree of 2 or 3 should fit quite well.

shlima commented 4 years ago

@eladcn thank you, your suggestion works.

Now I have 2 cases:

Estnoia with polynomial_degree of 3:

photo_2020-04-10 15 47 21

Estnoia with polynomial_degree of 5:

photo_2020-04-10 15 47 23

It seems that the second chart for Estonia is more believable.

Can you suggest me a pattern by which I can set the polynomial_degree to the correct value for each country ?

Dataset for Estonia:

0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0
10,0
11,0
12,0
13,0
14,0
15,0
16,0
17,0
18,0
19,0
20,0
21,0
22,0
23,0
24,0
25,0
26,0
27,0
28,0
29,0
30,0
31,0
32,0
33,0
34,0
35,0
36,1
37,1
38,1
39,1
40,1
41,2
42,2
43,3
44,10
45,10
46,10
47,10
48,12
49,16
50,16
51,79
52,115
53,171
54,205
55,225
56,258
57,267
58,283
59,306
60,326
61,352
62,369
63,404
64,538
65,575
66,645
67,679
68,715
69,745
70,779
71,858
72,961
73,1039
74,1097
75,1108
76,1149
77,1185
78,1207
eladcn commented 4 years ago

Unfortunately there isn't really a pattern for this, but I can give you a few tips:

  1. Visualizing the data is great because you can see how many inflections points you have - the more inflections points you have will mean that the model will need to fit a little bit more - hence a higher polynomial degree might be better.
  2. If you see that according to the data, the increase rate is close to linear - then a polynomial degree of 1 will probably do well, if you see that the increase rate is similar to a quadratic rate - then a polynomial degree of 2 will probably do well, etc...
  3. You don't need to fit your model to all of you data - if suddenly there's a huge change in the rate of increase, you can take for example only the last 10 days in your dataset.
  4. The more data you have - the more stable the model will be.
  5. If one polynomial degree (say a polynomial degree of 7) seems to suddenly give unusual values, tweaking the polynomial degree up or down by 1 or 2 values may do the trick - it really depends on the data - if the data changes dramatically, you will probably need to tweak it.

I will consider adding neural network support to the this project in the coming days - using neural networks might be better for some scenarios.

shlima commented 4 years ago

Thank you very much for your support