VirusTotal / vt-py

The official Python 3 client library for VirusTotal
https://virustotal.github.io/vt-py/
Apache License 2.0
531 stars 121 forks source link

URL scanning returning NotFoundError #181

Closed HeinzJS closed 6 months ago

HeinzJS commented 6 months ago

I am trying to scan a list of urls for its statistics but every time a new url is scanned, it returns a NotFoundError. However, after searching for the exact same url in VirusTotal's website and re-running the code again, it works...

import vt

KEY = "{KEY_HERE}"
client = vt.Client(KEY)

# List of (Title, URL) extracted from json file
dev_urls = [[item['title'], item['dev_web']] for item in data]

with open('output/vt_report.csv', 'a', newline='') as f:
    writer = csv.writer(f)
    num_lines = len(pd.read_csv('output/vt_report.csv'))
    if num_lines == 0:
        writer.writerow(['Title', 'URL', 'Last Analysis Stats', 'Last Final URL', 'Total Votes'])  

    for i in range(num_lines-2, num_lines+3):
        print(dev_urls[i])
        url_id = vt.url_id(dev_urls[i][1])
        print(url_id)
        try:
            url = client.get_object("/urls/{}", url_id)
            data = [
                dev_urls[i][0],
                url.url,
                url.last_analysis_stats,
                url.last_final_url,
                url.total_votes
            ]
            print(data)
            # writer.writerow(data)
        except vt.APIError as e:
            print(f"Error: {e}")
f.close()

Example: Url: https://steamcommunity.com/linkfilter/?u=https%3A%2F%2Fwww.waryards.com Title: War Yards

1st time running, the following output was received: image After inputting the error URL into VirusTotal website, this output was received: image

The list has 500+ elements and upon initial scanning (python num_lines = 0) everything was working fine, until element 218. Upon re-running to start scanning from last it left off, this problem occurred.

Any help would be greatly appreciated! It's my first time writing an issue, so sorry if this is bad formatting.

mgmacias95 commented 6 months ago

Hello @HeinzJS,

This part of the code is assuming the URL exists in VT:

            url = client.get_object("/urls/{}", url_id)

I would change the code to be like this:

        try:
            url = client.get_object("/urls/{}", url_id)
            data = [
                dev_urls[i][0],
                url.url,
                url.last_analysis_stats,
                url.last_final_url,
                url.total_votes
            ]
            print(data)
            # writer.writerow(data)
        except vt.APIError as e:
            if e.code == 'NotFoundError':
                client.scan_url(dev_urls[i][1], wait_for_completion=True)
                url = client.get_object("/urls/{}", url_id)
            print(f"Error: {e}")

Also, if you are scanning a long list of URLs I would recommend using async code to improve the code's performance.

I hope this helps.

Regards, Marta

HeinzJS commented 6 months ago

Hello @mgmacias95,

Thank you so much for your help, the problem has been resolved.

Just a few question out of curiosity, this behavior only occurs when scanning some URLs, as some just works fine without any problem, even during initial scan.

Might this be because that specific link has been search by other users beforehand? or what exactly does "URL exists in VT" mean?

Thank you once again, I'll have a look at async code as per your suggestion.

Should I close the issue now?

Regards, Heinz

mgmacias95 commented 6 months ago

Hello @HeinzJS,

If an URL exists in VT, it means another user has previously scanned it before. When you query the API GET/urls/{id} you are requesting for the latest scan we did on that URL. On the other hand, when you do POST/urls, a new scan is triggered in that moment (you can do this for URLs that are already present in VT as well) and a database entry is created which you can fetch by calling GET/urls/{id}.

I hope this clarifies your question.

Regards, Marta