For many websites, the NVIP crawler uses a Generic Parser to extract CVE information, and in many cases the information it gets will be malformed or irrelevant. To combat this, the first stage of the reconciler is to filter out the garbage. To develop this filtering stage, the following must be done:
[ ] Create a json dataset by manually labeling crawler data (you can find this in the rawdescription table after a crawler run). This dataset should be used to develop the following filters:
[ ] Develop a set of "easy" filters which are run locally and use no resource-heavy ML. You will have to base these filters' operations on your observations of the crawler data.
[ ] (Optional) Develop a set of filters that use more computationally expensive local operations (some NLP models are currently being used in reconcilers, try these)
[ ] Develop a filter that calls a fine-tuned OpenAI model to detect any remaining garbage. This will probably not require their most advanced models, start with the cheapest/simplest ones first and move up to more expensive models as necessary.
[ ] Enhance OpenAI filter to assign a confidence score to each raw vulnerability instead of a mere pass/fail.
Notes on OpenAI development:
You get $20 worth of tokens when opening an account and that should be plenty to experiment and develop with. If you find yourself running out, alert Matt or Andrew ASAP.
There is an open-source community-maintained Java library, but it lacks some tools of the official Python libraries. If it is easier for you to develop in Java, then go ahead with that, but final implementations should call a Python script for any OpenAI communications.
For many websites, the NVIP crawler uses a Generic Parser to extract CVE information, and in many cases the information it gets will be malformed or irrelevant. To combat this, the first stage of the reconciler is to filter out the garbage. To develop this filtering stage, the following must be done:
rawdescription
table after a crawler run). This dataset should be used to develop the following filters:Notes on OpenAI development: