Open SahithiKasim opened 1 year ago
@absol27 Could you tell them how to get the publish data?
@absol27 we are testing scripts on the Publish branch. Right now there's an inconsistency between snapshot_download.py and the publish_parser, one is reading from file ending with Packages and one reading from Packages.dump. After we changed that, there's a UnicodeDecodeError in some of the files (it cannot decode \0x90, \0x9D, etc). Have you tested it out and have it working on your end?
@SahithiKasim we have collected sample data, confirmed that build_info, maintainer and popcon scripts all work. Do we need to write pytest assertions for those at the moment?
I don’t think we need pytest assertions now! But it will be helpful if you write a requirements file like what packages need to be installed and any paths to be set before running the scripts.
@VinhPham2106 I'm assuming that is when the file doesn't completely download or if it was corrupted download. In the ideal case, it should work, I verified it. If it is corrupted for any reason, it does not recognize the magic bytes to decompress.
I would recommend, you could add code to retry and redownload if the gzip decompress fails. Also if the download fails(I know the download 404s sometimes and requires a restart), I didn't add that. So a couple of try-and-catch statements should handle that.
Run the scripts on these #20 and #26 https://github.com/TSELab/guac-alytics/tree/parsers https://github.com/TSELab/guac-alytics/tree/vulnerability_data