emergenzeHack / covid19italia

Condividiamo informazioni e segnalazioni sul COVID19
https://www.covid19italia.help
MIT License
75 stars 42 forks source link

Fetch only modified issues in github2CSV.py script #546

Closed trampfox closed 4 years ago

trampfox commented 4 years ago

This PR introduces some changes to the existing github2CSV.py script in order to fetch only modified or new issues.

When the script calls the get_issues to retreve the issues from the configured repository, it sets the since parameter in order to retrieve only the modified or new issue. The issues are stored by the updated_at field:

issues = r.get_issues(since=last_time, labels=filter_labels, state='all', sort='updated')

since parameter is set using the most recent timestamp (+1 second) found in the CSV file. I used the CSV file because is a mandatory program argument, so it should always be present, I guess.

I took the opportunity to refactor a little bit the code and to improve the log messages of the script.

Closes #492

trampfox commented 4 years ago

As suggested by @mfortini, I verified that the "rewrite" of the existing issues doesn't introduce encoding issues.

I downloaded the following files from the _emergenzeHack/covid19italiadata

and I tried to run the github2CSV.py script forcing the latest timestamp to 2020-04-24 20:57:13, which is the latest timestamp that can be found in the issues.csv file as of the commit 1ddb1b534b12fe4cd5853ab566af03518852a81c.

Below the output logs of the script

2020-04-25 11:44:43,833 - github2CSV - INFO - Retrieving issues from Github (since 2020-04-24 20:57:14)...
2020-04-25 11:44:44,043 - github2CSV - INFO - 0 issues retrieved...
2020-04-25 11:44:44,239 - github2CSV - INFO - [CSV] Updating issues (if any)...
2020-04-25 11:44:44,439 - github2CSV - INFO - [CSV] Writing new issues...
2020-04-25 11:44:44,440 - github2CSV - INFO - [CSV] Total issues: 1722
2020-04-25 11:44:44,500 - github2CSV - INFO - [JSON] Updating issues (if any)...
2020-04-25 11:44:44,501 - github2CSV - INFO - [JSON] Writing new issues...
2020-04-25 11:44:44,501 - github2CSV - INFO - [JSON] Total issues: 1722
2020-04-25 11:44:44,614 - github2CSV - INFO - [GeoJSON] Updating issues (if any)...
2020-04-25 11:44:44,615 - github2CSV - INFO - [GeoJSON] Writing new issues...
2020-04-25 11:44:44,615 - github2CSV - INFO - [GeoJSON] Total issues: 1722
2020-04-25 11:44:44,676 - github2CSV - INFO - Done.

So, the script founds no new issues and it saves the existing 1722 issues to the following new files

Then, I executed the diff command on each of the three files comparing them with the files from the _emergenzeHack/covid19italiadata repository and no differences have been found for the JSON and GeoJSON files.

For the CSV file some diffs were found, due to some additional white space introduced; re-running the command using the --ignore-space-change (that ignores changes in the amount of white space) shows no differences between the file from the repository and the one generated by the script.

Below the output of the md5 command:

❯ md5 issuesjson.json _githubdata/issuesjson.json
MD5 (issuesjson.json) = a86df14f7e92e609c1117dad86e1e0d1
MD5 (_githubdata/issuesjson.json) = a86df14f7e92e609c1117dad86e1e0d1

❯ md5 issues.geojson _githubdata/issues.geojson
MD5 (issues.geojson) = 6f7a66b0d4e4aa7478391a7f97b6829c
MD5 (_githubdata/issues.geojson) = 6f7a66b0d4e4aa7478391a7f97b6829c
trampfox commented 4 years ago

@mfortini I also added the timezone to the max value of the timestamp retrieved from the CSV file. As for the timestamps from the Github API, the max updated_at is treated as a UTC time.

avivace commented 4 years ago

Looks good to me. @mfortini did you have the time to run any test?