chorsley / python-Wappalyzer

Python driver for Wappalyzer, a web application detection utility.
GNU General Public License v3.0
309 stars 122 forks source link

created test for valid selector that does not increase time #79

Open brandonscholet opened 1 year ago

brandonscholet commented 1 year ago

My room wappybird implement ls your library. I started pulling the updated wappalyzer libraries. They have had issues with valid json, so I started pulling the current release of, but the tally selector is malformed. I talked to the maintainer of soupsieve and they provided a function to tech for valid selectors and skip if not. This replaces the crude try/catch code

I can update your repo to pull the current technologies if you would like. Or feel free to pull from wappybird.

Also, the pip is out of date and incompatible with the updated technologies files

tristanlatr commented 1 year ago

Thanks @brandonscholet. Can you provide a test for an invalid selector please ?

brandonscholet commented 1 year ago

The current release of npm-Wappalyzer has this broken selector Broken Selector iframe[scr='//airtable.com/'], a[href='//airtable.com/][target='_blank']

brandonscholet commented 1 year ago

This will pull the latest into the technology file. They have had broken selectors for the past two releases

def update_technologies_from_latest():
    print("updating technologies")
    technologies_file = os.path.expanduser('~/.python-Wappalyzer/technologies.json')
    technologies = {}

    #get release page
    latest_release = requests.get('https://api.github.com/repos/wappalyzer/wappalyzer/releases/latest').json()
    #get zip from url
    zip_url = requests.get(latest_release['zipball_url'])
    myzip = ZipFile(io.BytesIO(zip_url.content)) 

    #parse files
    for listed_file in myzip.namelist():
        #get all technology files
        if "src/technologies" in listed_file and ".json" in listed_file:
            #extract file into json
            tech_json_file=myzip.read(listed_file).decode('UTF-8')
            tech_json = json.loads(tech_json_file)
            #add to full json
            technologies = {**technologies, **tech_json}
        if "src/categories.json" in listed_file:
            #extract categories into json
            categories = json.loads(myzip.read(listed_file).decode('UTF-8'))
        #merge into one object
    combined_object = {'categories': categories, 'technologies': technologies}

    #write to file
    with open(technologies_file, 'w', encoding='utf-8') as tfile:
        tfile.write(json.dumps(combined_object))
        tfile.flush()
    print("done!\n")

webpage = WebPage.new_from_url("https://example.com", verify=False, timeout=60)
wappalyzer= Wappalyzer.latest(technologies_file=technologies_file)
techs = wappalyzer.analyze_with_versions_and_categories(webpage)
brandonscholet commented 1 year ago

looking back, the print statement should probably be removed.