Create a test dataset for parsers

TSELab / guac-alytics

A series of tools and resources to better understand the risk profile of open source software ecosystems

Apache License 2.0

2 stars 0 forks source link

Create a test dataset for parsers #34

Open SahithiKasim opened 1 year ago

SahithiKasim commented 1 year ago

Run the scripts on these #20 and #26 https://github.com/TSELab/guac-alytics/tree/parsers https://github.com/TSELab/guac-alytics/tree/vulnerability_data

[x] Create a branch on your name and add your scripts accordingly.
[x] Write a requirements file.
[x] Update the constants such that they get automated.
[x] Run the code on sample data - try taking it from the tower or download it.
[x] Sample the packages in all 3 tables - Upstream, buildinfo and publish.
[x] Add your sample data here - https://github.com/TSELab/guac-alytics/tree/main/scripts/tests/data
[x] Add your scripts to tests - https://github.com/TSELab/guac-alytics/tree/main/scripts/tests

SahithiKasim commented 1 year ago

@absol27 Could you tell them how to get the publish data?

VinhPham2106 commented 1 year ago

@absol27 we are testing scripts on the Publish branch. Right now there's an inconsistency between snapshot_download.py and the publish_parser, one is reading from file ending with Packages and one reading from Packages.dump. After we changed that, there's a UnicodeDecodeError in some of the files (it cannot decode \0x90, \0x9D, etc). Have you tested it out and have it working on your end?

VinhPham2106 commented 1 year ago

@SahithiKasim we have collected sample data, confirmed that build_info, maintainer and popcon scripts all work. Do we need to write pytest assertions for those at the moment?

SahithiKasim commented 1 year ago

I don’t think we need pytest assertions now! But it will be helpful if you write a requirements file like what packages need to be installed and any paths to be set before running the scripts.

absol27 commented 1 year ago

@VinhPham2106 I'm assuming that is when the file doesn't completely download or if it was corrupted download. In the ideal case, it should work, I verified it. If it is corrupted for any reason, it does not recognize the magic bytes to decompress.

I would recommend, you could add code to retry and redownload if the gzip decompress fails. Also if the download fails(I know the download 404s sometimes and requires a restart), I didn't add that. So a couple of try-and-catch statements should handle that.

SahithiKasim commented 1 year ago

[x] Check the file size and add them to the git.