TSELab / guac-alytics

A series of tools and resources to better understand the risk profile of open source software ecosystems
Apache License 2.0
2 stars 0 forks source link

Environment Variables #42

Open SahithiKasim opened 1 year ago

SahithiKasim commented 1 year ago
SahithiKasim commented 1 year ago

2100 mkdir demo 2101 cd demo/ 2102 vim constants.py 2103 touch init.py 2104 cd .. 2105 python 2106 vim demo/constants.py 2107 vim things.py 2108 python things.py 2109 LOC="salut" python things.py 2110 export LOC="hola mundo" 2111 python things.py 2112 unset LOC 2113 python things.py

constants.py

import os if os.getenv("LOC"): LOC = os.getenv("LOC") else: LOC = "hello world"

things.py:

import demo.constants print(demo.constants.LOC)

VinhPham2106 commented 1 year ago

@SantiagoTorres the options to specify the constants are added to the parsers branch. We wonder how will the usage of each variable be explained to users, other than having to look into every scripts? Should we make a doc or is there already one?

JorgeH309 commented 1 year ago

@SahithiKasim @absol27 When we try to test run the publish scripts, it takes about 40 minutes to parse through the publish packages of just one date. We realized that it takes the most amount of time when parsing the main dump file of a certain date. Since we finished our other tasks, should we try to optimize the script's speed or should we leave it as it is?

VinhPham2106 commented 1 year ago
def parse_packagelist(date, ARCH, db_location, DFSG):
    counter = 0
    con = open_db(db_location)
    with open(f'./ingestion/parsers/Packagelist_DUMP/{date}-{ARCH}-{DFSG}_Packages.dump','r', encoding='utf-8') as rf:
        header = ""
        for line in rf:
            if line == "\n":
                parsed_package =  parser.parse_string(header).normalized_dict()
                parsed_package["added_at"] = date
                cur = con.cursor()
                parsed_package["architecture"] = ARCH
                parsed_package["provided_by"] = ""
                insert_package(cur, parsed_package, DFSG)
                provided_by = cur.lastrowid
                for provided_package in parsed_package["provides"]:
                    parsed_package["package"] = provided_package
                    parsed_package["version"] = ""
                    parsed_package["size"] = ""
                    parsed_package["provided_by"] = provided_by
                    insert_package(cur, parsed_package, DFSG)
                con.commit()
                header = ""
            else:
                header += line
    close_db(con)
    return

In this function it's making a commit to the db after picking up each package. We did some read up on sqlite3 and it said that it can run 50000 statements per second. We wonder if we should try to optimize the code by bundling up statements and reduce the amount of commits, are the current speed is good enough (16 hours for all 5 years of data, on 8GB RAM). We are not super proficient with database so professor @sbrunswi can you take a quick look.

absol27 commented 1 year ago

The main dump file is the bigger file of three(on a date) by a lot. I suspect that more than the processing of the file, the problem is with the downloading of the package list dumps. I've added a condition to retry until it successfully downloads, but I noticed that it could be stuck in the error loop for a while. Terminating and restarting the program solves the issue, I've discussed this with @VinhPham2106 . I wonder if this is the issue more than the processing of the file, please let me know if I'm wrong.

I believe that the code could be optimized, but I would recommend checking out what the package list dumps are and their structure, to understand what each component means. (things like why 'parsed_package["provides"]' needed to be processed differently. )

absol27 commented 1 year ago

Btw committing multiple write queries in a single commit is definitely one optimization, there's no reason to have 1 write query per commit.