TSELab / guac-alytics

A series of tools and resources to better understand the risk profile of open source software ecosystems
Apache License 2.0
2 stars 0 forks source link

Create a test dataset for parsers #34

Open SahithiKasim opened 1 year ago

SahithiKasim commented 1 year ago

Run the scripts on these #20 and #26 https://github.com/TSELab/guac-alytics/tree/parsers https://github.com/TSELab/guac-alytics/tree/vulnerability_data

SahithiKasim commented 1 year ago

@absol27 Could you tell them how to get the publish data?

VinhPham2106 commented 1 year ago

@absol27 we are testing scripts on the Publish branch. Right now there's an inconsistency between snapshot_download.py and the publish_parser, one is reading from file ending with Packages and one reading from Packages.dump. After we changed that, there's a UnicodeDecodeError in some of the files (it cannot decode \0x90, \0x9D, etc). Have you tested it out and have it working on your end?

VinhPham2106 commented 1 year ago

@SahithiKasim we have collected sample data, confirmed that build_info, maintainer and popcon scripts all work. Do we need to write pytest assertions for those at the moment?

SahithiKasim commented 1 year ago

I don’t think we need pytest assertions now! But it will be helpful if you write a requirements file like what packages need to be installed and any paths to be set before running the scripts.

absol27 commented 1 year ago

@VinhPham2106 I'm assuming that is when the file doesn't completely download or if it was corrupted download. In the ideal case, it should work, I verified it. If it is corrupted for any reason, it does not recognize the magic bytes to decompress.

I would recommend, you could add code to retry and redownload if the gzip decompress fails. Also if the download fails(I know the download 404s sometimes and requires a restart), I didn't add that. So a couple of try-and-catch statements should handle that.

SahithiKasim commented 1 year ago