BlankerL / DXY-COVID-19-Data

2019新型冠状病毒疫情时间序列数据仓库 | COVID-19/2019-nCoV Infection Time Series Data Warehouse
https://lab.isaaclin.cn/nCoV/
MIT License
2.16k stars 707 forks source link

cannot download with read.csv() #37

Closed zh-zhang1984 closed 4 years ago

zh-zhang1984 commented 4 years ago
DTdxy <- read.csv("https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv", header = TRUE, stringsAsFactors = FALSE)

It raise an error:

Error in file(file, "rt") : cannot open the connection to 'https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv'

With a warning message:

Warning message: In file(file, "rt") : URL 'https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv': status was 'Couldn't connect to server'

And when I copy the https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv to browser, the connection cannot be opened either.

BlankerL commented 4 years ago

Hello, I am able to open this file from the browser. In the meantime, when I test this line in my R interpreter, it shows that I can correctly load the data into the data frame.

image

Therefore, there is nothing wrong with your code and the file itself, so the most possible reason is that your computer cannot resolve the IP address for raw.githubusercontent.com, so you are not able to connect to the server and get this file.

I guess you are in mainland China. Actually, GitHub's DNS resolution quality is poor in mainland China, the resolved IP result might not be the fastest and connection speed is quite poor.

I have the following suggestions:

  1. Please try to change a DNS server or modify your hosts file to connect to GitHub with faster speed and more stable connection, you might be able to load the file with the URL if you solve the problem in this way, and it is the most decent method, you can simply Google or Baidu with the keyword "GitHub DNS",
  2. You can try to connect to some VPN service to speed up your connection to the GitHub server, but if the problem is on DNS resolution, you might not be able to get the content even if you are using VPN, and,
  3. Directly download the CSV file from the project front page, there is a "Clone or Download" button, and you can click "Download ZIP" to download both the CSV and JSON files. However, you might need to manually update the CSV file if you are using this method. image
BlankerL commented 4 years ago

For the scientific purpose, I will offer another link for you to reach out for the CSV raw content soon. Please wait for about 30 minutes, I will try to fix it as soon as possible.

BlankerL commented 4 years ago

Hello, I have fixed this issue. If the aforementioned methods are too complicated for you, you can just switch the URL to https://lab.isaaclin.cn/csv/DXYArea.csv, and everything will just work.

All the four CSV files can be accessed from this link by changing the filename.

However, these CSV files are stored on my server, so it will be a little bit slower than downloading it from GitHub.

Furthermore, every time you run this line, your code will actually download this file, and there will be a relatively heave traffic on the server-side. Therefore, please try to load it as fewer times as possible and backup at first, for example:

 # Just load once
DTdxy_backup <- read.csv("https://lab.isaaclin.cn/csv/DXYArea.csv", header = TRUE, stringsAsFactors = FALSE)

# Do research on DTdxy
DTdxy <- DTdxy_backup

# If you messed up with DTdxy, reload it from DTdxy_backup
DTdxy <- DTdxy_backup

Hope you can enjoy your research.

zh-zhang1984 commented 4 years ago

Yes, the new address worked well for me; However, the new dataset lack data before 2020-1-24; The earliest time should be on 2020-1-11. I use the following code and found data before that date is missing:

DTdxy_backup <- read.csv("https://lab.isaaclin.cn/csv/DXYArea.csv", 
                         header = TRUE, stringsAsFactors = FALSE)
> DTWuhan  <- DTdxy_backup[DTdxy_backup$cityEnglishName=="Wuhan",]
> DTWuhan %>%
+   mutate(Date = as.Date(updateTime)) %>%
+   group_by(Date) %>%
+   filter(updateTime == max(updateTime)) %>%
+   select(cityEnglishName,city_confirmedCount,city_suspectedCount,
+          city_curedCount,city_deadCount,Date) %>% tail()
# A tibble: 6 x 6
# Groups:   Date [6]
  cityEnglishName city_confirmedCount city_suspectedCount city_curedCount city_deadCount Date      
  <chr>                         <int>               <int>           <int>          <int> <date>    
1 Wuhan                          1905                   0              54            104 2020-01-29
2 Wuhan                          1590                   0              47             85 2020-01-28
3 Wuhan                           698                   0              42             63 2020-01-27
4 Wuhan                           618                   0              40             45 2020-01-26
5 Wuhan                           572                   0              32             38 2020-01-25
6 Wuhan                           495                   0              31             23 2020-01-24

Can you help me to retrieve the data before 2020-1-24?

BlankerL commented 4 years ago

Yes, the new address worked well for me; However, the new dataset lack data before 2020-1-24; The earliest time should be on 2020-1-11. I use the following code and found data before that date is missing:

DTdxy_backup <- read.csv("https://lab.isaaclin.cn/csv/DXYArea.csv", 
                         header = TRUE, stringsAsFactors = FALSE)
> DTWuhan  <- DTdxy_backup[DTdxy_backup$cityEnglishName=="Wuhan",]
> DTWuhan %>%
+   mutate(Date = as.Date(updateTime)) %>%
+   group_by(Date) %>%
+   filter(updateTime == max(updateTime)) %>%
+   select(cityEnglishName,city_confirmedCount,city_suspectedCount,
+          city_curedCount,city_deadCount,Date) %>% tail()
# A tibble: 6 x 6
# Groups:   Date [6]
  cityEnglishName city_confirmedCount city_suspectedCount city_curedCount city_deadCount Date      
  <chr>                         <int>               <int>           <int>          <int> <date>    
1 Wuhan                          1905                   0              54            104 2020-01-29
2 Wuhan                          1590                   0              47             85 2020-01-28
3 Wuhan                           698                   0              42             63 2020-01-27
4 Wuhan                           618                   0              40             45 2020-01-26
5 Wuhan                           572                   0              32             38 2020-01-25
6 Wuhan                           495                   0              31             23 2020-01-24

Can you help me to retrieve the data before 2020-1-24?

Currently, only province-level data after January 22 and city-level data after January 24 are obtained and available by the crawler. Those data are published by Ding Xiang Yuan and supposed to be trust-worthy.

However, most of the data from January 11 to January 24 are missing and do not have a reliable source. If you are urgent, please try to find some reliable data sources and let me know. Now I have 2 data sources but the values are slightly different from each other, so I do not add them into the database and this data warehouse yet.

Now, I am the only one maintaining this project, but I'm not a professional in this field, so I don't know many data sources in this field. Currently, I have to cross-validate the data sources, and it will take a relatively long period of time. If you can find some reliable data sources, please contribute to this project. The data will be collected and added into the database very soon once there are reliable data sources.

Open source projects are not projects maintained and dedicated by a single person.

zh-zhang1984 commented 4 years ago

Thank you for your clarification. I will try to contribute this project if I can find some reliable source.

BlankerL commented 4 years ago

Thank you for your clarification. I will try to contribute this project if I can find some reliable source.

Thank you so much! Currently, this issue is widely discussed in BlankerL/DXY-COVID-19-Crawler#3 and #26, and 2 data sources I mentioned above are in these issues.