LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.98k stars 3.23k forks source link

Information Security Standards, Frameworks, Controls Mapping #1720

Open ascarola opened 1 year ago

ascarola commented 1 year ago

Per a comment left for me on Discord by Huu Nguyen#6677, I would like to help with the project.

Here is the background, which I posted earlier today: "I'm an InfoSec pro and would like to have a means to create mappings between different InfoSec standard frameworks, such as NIST, ISO, and some financial regulatory frameworks such as the FFIEC Cybersecurity Assessment Tool, NCUA's ISE assessment, ISO 27000-series, etc. Most all of this data is publicly accessible, except for the ISO standards. I've asked ChatGPT to help me perform some simple mappings; however, it provided completely incorrect information - utterly hallucinating in the responses. How can I help support Open Assistant to get it to help support other information security professionals with this generally simple task using publicly available datasets? This could be very valuable assistance for my field."

When asked to create a dataset, I replied: "I could absolutely help with that, but I'm wondering if it would be possible for the engine to kick-off the help initially. For example, I could easily provide it with the standards (with assistance in formatting them) for multiple datasets, and it would be great if it could begin to perform some generic mapping based on the declarative statements. For example, given the following two statements, determine how similar the "controls" are: "the company installs firewalls at the perimeter of the network" and "the company has board-approved information security policies". The result of this operation then could then be reviewed by myself and other infosec pros to validate those responses. The datasets contain many declarative control statements, such as in the 250-500+ range each."

I'm not a developer, more of a technical manager, and would need to be walked through the processes.

How can I help?!

bitplane commented 1 year ago

Sounds good :)

Steps to add a dataset are documented here:

https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md

How are your Python skills? The tricky part if you don't have Python skills will be taking the docs and converting them into the question+answer format that we're using for training.

If you post bits of research and background info in this ticket (even feel free to use it as a bit of a journal) and ask for help on Discord the community will help you along with the steps.

Also, I'd be happy to help as I've not been through this dataset creation process myself and would like to prove it out. Feel free to DM me on Discord.

ascarola commented 1 year ago

Hi Gareth,

I'm starting to think about how to do this. I reviewed the documentation, but am stuck on the second step ("Creating a Dataset on Hugging Face").

I'm not a developer. I do understand Linux and have a VM running on my VirtualBox; however, I do not know where to start with the code. Do you have more of a novice guide on how to do this?

Otherwise, I can provide the mapping documentation that I have, and request that you or someone else set up the dataset, as another option.

I appreciate your thoughts.

Anthony

On Sat, Feb 18, 2023 at 5:10 PM Gareth Davidson @.***> wrote:

Sounds good :)

Steps to add a dataset are documented here:

https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1720#issuecomment-1435779799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASE4ROMK3OAJLQB5SEC5WKLWYFCFLANCNFSM6AAAAAAVAP64EQ . You are receiving this because you authored the thread.Message ID: @.***>

ascarola commented 1 year ago

Hi Gareth,

So, for the first dataset, I think we should first focus on the CRI Profile, from here https://cyberriskinstitute.org/the-profile/. I've attached the spreadsheet.

License for The Profile is here https://creativecommons.org/licenses/by-nc-nd/4.0/ (Creative Commons), and as you can see, we would need to provide attribution in the responses.

As for an example of how to use this with Open Assistant, considering the spreadsheet tab "CRI Profile to FFIEC CAT", which provides a mapping between the Cyber Risk Institute Profile diagnostic statements (controls) to the FFIEC CAT declarative statements; and specifically considering the first statement ID GV.SF-1.1:

Of course, this could be repeated for the other mappings as well, such as for tab "Mapping to NYDFS", "Mapping to NIST CSF v1.1 v2", etc. I think these three would be the most valuable, for US financial services (banks).

Also, I do know one of the creators of this spreadsheet (Josh Magri), so we could determine if the attribution was really necessary.

Another very important mapping is the FFIEC CAT to the NIST CSF. Details of that can be found here https://www.ffiec.gov/pdf/cybersecurity/FFIEC_CAT_App_B_Map_to_NIST_CSF_June_2015_PDF4.pdf, and are not subject to copyright (it was created by the US government). See specifically starting at page nine, which shows the NIST Cybersecurity Framework to FFIEC Cybersecurity Assessment Tool mapping. And this would be used for Open Assistant like this:

There are other mappings which have not yet been created, such as the NCUA Information Security Examination (ISE) to FFIEC Cybersecurity Assessment Tool (CAT), which would be immensely valuable, but I have to create that or find someone who already has.

On Sat, Feb 18, 2023 at 5:10 PM Gareth Davidson @.***> wrote:

Sounds good :)

Steps to add a dataset are documented here:

https://github.com/LAION-AI/Open-Assistant/blob/main/openassistant/datasets/README.md

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1720#issuecomment-1435779799, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASE4ROMK3OAJLQB5SEC5WKLWYFCFLANCNFSM6AAAAAAVAP64EQ . You are receiving this because you authored the thread.Message ID: @.***>

huu4ontocord commented 1 year ago

Than you for taking this on @ascarola. Let us know in the discord if you need help coding this. I think the mappings look good :)

ascarola commented 1 year ago

So, today I reached out to the creator of the data (aka "The Profile") which I wish to create a dataset from. The data is licensed under Creative Commons, which requires attribution. I think he's good with using the data for the dataset; however, he'd like to have a call with someone responsible for the project to chat about it more. I think he is also very much interested in using the release internally for his organization (Cyber Risk Institute - not-for-profit - see https://cyberriskinstitute.org/the-profile/).

Can someone help me to setup a call with the organization? I'm guessing it would be with someone at LAION?

In the meantime, can someone help me to format the data into a usable dataset?

bitplane commented 1 year ago

I can't see any spreadsheet attached here. Can you attach it so I can have a peek at it please?

ascarola commented 1 year ago

Here it is, again.

On Wed, Feb 22, 2023 at 1:12 AM Gareth Davidson @.***> wrote:

I can't see any spreadsheet attached here. Can you attach it so I can have a peek at it please?

— Reply to this email directly, view it on GitHub https://github.com/LAION-AI/Open-Assistant/issues/1720#issuecomment-1439490397, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASE4ROI3PBZRB27CJBNKGELWYWU6XANCNFSM6AAAAAAVAP64EQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Anthony Scarola 757-418-0082

ascarola commented 1 year ago

Trying again. CRI_Profile_v1.2.1_Non-MacroEnabled_Assessment-Jan-2023.xlsx

bitplane commented 1 year ago

In the meantime, can someone help me to format the data into a usable dataset?

Okay so I downloaded it and had a look through it. I guess we just need to iterate over the data in various ways. Here's a bit of an outline of the tasks to get it done:

  1. Extract the sheets into CSV format (a one-liner to duplicate unmerged cells and another one to dump each one out)
  2. Make some prompts and response templates that we can use; boilerplate for questions and answers that we can use randomly for each one.
  3. In a first pass, make simple question and answer json trees that use the templates. Just simple Q -> A, one deep.
  4. Think about how we can add more context to make them conversation-like, like adding stuff on the beginning for context, and linking stuff together.
  5. Add some summary questions and answers about the dataset, asking what the different acronyms are all about, who issued them and so on. Just as a broader context thing to stitch it together and keep the info in one place.
  6. Then we'd get this code and put it into a huggingface repo (and here on GitHub just because)
  7. Run it to create the data, and upload it to huggingface as a dataset.
  8. Then fork this repo and add code that downloads and imports the dataset, and put it in a PR that links to and closes this issue.

The main blocker here would be that the license states that "no derived works" are allowed, which would make the dataset uploaded to huggingface a copyright violation.

For the attribution we could put "According to cyberriskinstitute.org", "According to data produced by CRI", "Cyber Risk Institute mappings of these standards say" into the output templates, which would preserve attribution for each of the mappings and statements. But the author would have to agree to that I guess.

The data collection efforts are a community thing and appropriate datasets will be selected later by the ML team, and if they don't want the data they'll be free to not train with it, but if the data is good they probably will. Point is it forms part of a broader data collection movement for all models and research, and isn't directly owned by the project so I doubt project leads could issue any assurances over the phone. But I'll not speak for them of course.