Closed trampfox closed 4 years ago
As suggested by @mfortini, I verified that the "rewrite" of the existing issues doesn't introduce encoding issues.
I downloaded the following files from the _emergenzeHack/covid19italiadata
and I tried to run the github2CSV.py
script forcing the latest timestamp to 2020-04-24 20:57:13
, which is the latest timestamp that can be found in the issues.csv
file as of the commit 1ddb1b534b12fe4cd5853ab566af03518852a81c
.
Below the output logs of the script
2020-04-25 11:44:43,833 - github2CSV - INFO - Retrieving issues from Github (since 2020-04-24 20:57:14)...
2020-04-25 11:44:44,043 - github2CSV - INFO - 0 issues retrieved...
2020-04-25 11:44:44,239 - github2CSV - INFO - [CSV] Updating issues (if any)...
2020-04-25 11:44:44,439 - github2CSV - INFO - [CSV] Writing new issues...
2020-04-25 11:44:44,440 - github2CSV - INFO - [CSV] Total issues: 1722
2020-04-25 11:44:44,500 - github2CSV - INFO - [JSON] Updating issues (if any)...
2020-04-25 11:44:44,501 - github2CSV - INFO - [JSON] Writing new issues...
2020-04-25 11:44:44,501 - github2CSV - INFO - [JSON] Total issues: 1722
2020-04-25 11:44:44,614 - github2CSV - INFO - [GeoJSON] Updating issues (if any)...
2020-04-25 11:44:44,615 - github2CSV - INFO - [GeoJSON] Writing new issues...
2020-04-25 11:44:44,615 - github2CSV - INFO - [GeoJSON] Total issues: 1722
2020-04-25 11:44:44,676 - github2CSV - INFO - Done.
So, the script founds no new issues and it saves the existing 1722 issues to the following new files
Then, I executed the diff
command on each of the three files comparing them with the files from the _emergenzeHack/covid19italiadata repository and no differences have been found for the JSON and GeoJSON files.
For the CSV file some diffs were found, due to some additional white space introduced; re-running the command using the --ignore-space-change
(that ignores changes in the amount of white space) shows no differences between the file from the repository and the one generated by the script.
Below the output of the md5
command:
❯ md5 issuesjson.json _githubdata/issuesjson.json
MD5 (issuesjson.json) = a86df14f7e92e609c1117dad86e1e0d1
MD5 (_githubdata/issuesjson.json) = a86df14f7e92e609c1117dad86e1e0d1
❯ md5 issues.geojson _githubdata/issues.geojson
MD5 (issues.geojson) = 6f7a66b0d4e4aa7478391a7f97b6829c
MD5 (_githubdata/issues.geojson) = 6f7a66b0d4e4aa7478391a7f97b6829c
@mfortini I also added the timezone to the max value of the timestamp retrieved from the CSV file. As for the timestamps from the Github API, the max updated_at
is treated as a UTC time.
Looks good to me. @mfortini did you have the time to run any test?
This PR introduces some changes to the existing
github2CSV.py
script in order to fetch only modified or new issues.When the script calls the
get_issues
to retreve the issues from the configured repository, it sets thesince
parameter in order to retrieve only the modified or new issue. The issues are stored by theupdated_at
field:since
parameter is set using the most recent timestamp (+1 second) found in the CSV file. I used the CSV file because is a mandatory program argument, so it should always be present, I guess.I took the opportunity to refactor a little bit the code and to improve the log messages of the script.
Closes #492