IS5882 / Open-CyKG

77 stars 20 forks source link

Datasets needed for OIE and NER #9

Open nitinpi0210 opened 2 years ago

nitinpi0210 commented 2 years ago

Hi Sarhan, for our NLP course project at Berkeley, we are following your paper on opencykg. Just as another user Malcom explained in one of the posts, we also need the datasets you used for the OIE python notebook. I downloaded the malwaretextdb database directly from your paper's reference but that doesn't contain any of the fields required by the downstream code such as : word_id word pred pred_id head_pred_id sent_id run_id label

Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : nitin.pillai@berkeley.edu.

We are in a time crunch here with course deadlines approaching. So would be grateful if you could give us access to the datasets that you used for the OIE and NER notebooks.

Thanks, Nitin

nitinpi0210 commented 2 years ago

Thanks for giving me access to All_MDB.csv. This file contains the fields : word_id, words, sent_id, label. To run the OIE notebook, I still need the following fields in the dataset : word_id word pred pred_id head_pred_id sent_id run_id label

For eg. the code you have from the Stanovsky paper for getting sentences from df, needs the runid : def get_sents_from_df( df):

Split a data frame by rows accroding to the sentences

  return [df[df.run_id == run_id]
        for run_id
        in sorted(set(df.run_id.values))]  

And then later on when you call load_dataset_encodeinputs, it needs the following fields : df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64') df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64') df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64') df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')

Is it possible for you to upload the MalwareDB dataset that you used for the OIE notebook that contains all the fields needed to successfully run the notebook?

nitinpi0210 commented 2 years ago

I just tried executing the OIE notebook using the new ALL_MDB.csv file you uploaded and as expected I get the following error as it doesn't contain the run_id field. Can you please uploaded the malware db dataset that contains the run_id field. Also can you please clarify what the run_id field is?

image

nitinpi0210 commented 2 years ago

I copied the entire column sent_id as run_id and got past the run_id issue but still the dataset doesn't contain all the fields required for the OIE notebook to run correctly. It needs the pred column and complaining on that :

image

image

malcolm1232 commented 2 years ago

Hi @nitinpi0210,

I am able to run to ~block 7, my code is here: https://colab.research.google.com/drive/1Kh9gsdG2rcySVo-GV5Xc9mW7rx6-pVuW?usp=sharing

But im facing error with Tensorflow, I put it here for you guys, Would really appreciate if you are able to solve the tensorflow issue!

nitinpi0210 commented 2 years ago

Hi @malcolm1232 can you share your notebook with nitin.pillai@berkeley.edu or nitinpillai@gmail.com? I can't access it to help you debug :

image

nitinpi0210 commented 2 years ago

Also @malcolm1232 how were you able to run until block 7 with the malware db dataset? It doesn't contain all those fields that are needed? Can you please upload the dataset that you ran OIE until Block 7 and give me access?

Sarhan said she will reply later this week as she is busy with her deadlines this week. So as soon as I get her dataset, I will try again too. But in the meantime if you modified the dataset to get it to run to Block 7, can you share that dataset? (nitin.pillai@berkeley.edu)

malcolm1232 commented 2 years ago

Hi, @nitinpi0210 , i have given access. The dataset used was from author under _MSB_all_csv.csv I was able to run via data manipulation from dataset provided by author (Assumingly i did it correctly) Have a good day! do let me know if you run into any troubles

hvjrocks-ds commented 2 years ago

Hi @malcolm1232, I am also facing the same issue below. Can you please give me the access to my email: harsh.jaiswal4@gmail.com.

Regards, Harsh Vardhan Jaiswal

Hi @malcolm1232 can you share your notebook with nitin.pillai@berkeley.edu or nitinpillai@gmail.com? I can't access it to help you debug :

image

malcolm1232 commented 2 years ago

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

nitinpi0210 commented 2 years ago

@malcolm1232 thanks for sharing. Btw, for the OIE notebook, we were supposed to use the malwaredb dataset as per the author. Why did you use the MSB dataset? That was supposed to be used for the NER notebook as per the paper.

nitinpi0210 commented 2 years ago

Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?

image

IS5882 commented 2 years ago

I updated the public shared folder with OIE dataset that includes all fields

IS5882 commented 2 years ago

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

For the NER ?

nitinpi0210 commented 2 years ago

I am using the following TF and Keras version :

image

But running into the following issue in Block 7

image
nitinpi0210 commented 2 years ago

This is for the OIE Notebook. @IS5882 what version of TF and Keras is needed for the OIE?

malcolm1232 commented 2 years ago

yes ive got the same problem as well, need to try to obtain tensorflow/keras version.

Update: Drive Folder here: https://drive.google.com/drive/folders/1zbf2bLLknxEHLJkcVKKmGHnwB9LseCID

Also, @nitinpi0210 do note that the spacy_wrapper were custom spacy wrapper i created .

Actual Code ; library which is not available anymore:

from spacy_wrapper import spacy_whitespace_parser as spacy_ws

Custom Code I wrote the custom spacy code from what i could undestand of the objective of the initial spacy_ws which is to "split on whitespace characters"

def spacyws(input):

input = str(input)

returns_ = input_.split()
return returns_

Also, @IS5882 so sorry for the trouble, but the spacy_wrapper.py file is empty U.U sorry for the inconvenience!

malcolm1232 commented 2 years ago

Also is this the right move to do? Can you clarify why you are setting the head pred id to 0 throughout the dataframe?

image

i did this because of the code:

assert(len(set(full_sent.head_pred_id.values)) == 1) # Sanity check If the len values ==1 as sanity check, i assumed it can be any integer.

nitinpi0210 commented 2 years ago

@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.

malcolm1232 commented 2 years ago

@malcolm1232 the author gave the new correct malware dataset that has the relevant fields. So you don't need to do all that DF modifications anymore. I just used the new dataset and can get to Block7 with no issues. Now dealing with tensorflow issues.

oh thanks a ton @IS5882 @nitinpi0210 ❤️ ❤️!!!

malcolm1232 commented 2 years ago

@hvjrocks-ds , i have done so already @IS5882 , was wondering if you recalled which tensorflow version you were using! Do feel free to let me know the tf version when u are free!

For the NER ?

@IS5882 I am trying to Run the OIE Notebook, but have encountered the same tensorflow/keras error as @nitinpi0210 , so just wondering what tensorflow/keras version you were using! oh yes also!! spacy_wrapper.py file is empty U.U

nitinpi0210 commented 2 years ago

@malcolm1232 @IS5882 The OIE notebook finally works. Didn't need to modify Google colab TF or Keras version and they are both running with their default 2.8.0 versions. What did the trick is the following 2 lines in Block 7 where in the original code it was tensorflow.python.keras..remove the python from there :

from tensorflow.keras.layers import Layer from tensorflow.keras import backend as K

image

nitinpi0210 commented 2 years ago

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset. Ran to completion finally ! image

malcolm1232 commented 2 years ago

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset.

Ran to completion finally !

image

OMMMGGGG!!! Okkays I'll give it a try and let u know!!

malcolm1232 commented 2 years ago

hi, @nitinpi0210 i was able to run the notebook as well, but is it possible to share yours so i could take a look at it as well? sorry for the inconvenience!

malcolm1232 commented 2 years ago

oh yes, i am wondering if you will be working in the Knowledge graph as well? @nitinpi0210 @IS5882 , i was wondering if you'd have the data for Knowledge_Graph_Canonicalization.ipynb as well!

nitinpi0210 commented 2 years ago

hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.

malcolm1232 commented 2 years ago

hey @malcolm1232 whats your email so that I can share? Btw with just those 2 lines you should be able to get things running. Am only doing the OIE piece for now and will do NER and KG later.

Hi ! @nitinpi0210 my email is malcolmTHL95@gmail.com . Yes i got them running already! But would like to see ur train test split etc. Im still working on the KG, which is even much much tougher to get working without corresponding datatsets xDD

qlM0ri4rty commented 2 years ago

@IS5882 @nitinpi0210 @hvjrocks-ds @hvjrocks-ds Help me,plz ! 9cb6c98a7bd46cc4a45481421770002

nitinpi0210 commented 2 years ago

@qlM0ri4rty the google verification code is based on your google credentials. When you run the cell, you should get a popup asking you to enter your google username and password. Make sure you enable popups in your browser so that it doesn't get blocked.

qlM0ri4rty commented 2 years ago

@nitinpi0210 Thanks,I just ran the notebook successfully,but i don't know why the output looks like this.I mean,this shouldn't be the NER's result? image

qlM0ri4rty commented 2 years ago

Hey!I have a new email address to contact you. I wonder why the OIE notebook has no output files.I mean,shouldn't it have a output files like .csv?I just saw the visualization of the prediction,I think there should be an output model,and a .csv file.

By the way,did you run the KG notebook?I'm working on it now.I hope I can get some help from you. Thanks!

寒蝉海猫 @.***

 

------------------ 原始邮件 ------------------ 发件人: "IS5882/Open-CyKG" @.>; 发送时间: 2022年5月20日(星期五) 上午6:24 @.>; @.**@.>; 主题: Re: [IS5882/Open-CyKG] Datasets needed for OIE and NER (Issue #9)

@qlM0ri4rty the google verification code is based on your google credentials. When you run the cell, you should get a popup asking you to enter your google username and password. Make sure you enable popups in your browser so that it doesn't get blocked.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

VICKY-ZZ commented 2 years ago

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset. Ran to completion finally ! image

Hi, could you please share the data files with me metioned above with me? I can't find it. My email is zzzxp111@gmail.com. Thank you so much!!!

VICKY-ZZ commented 2 years ago

hi, @nitinpi0210 i was able to run the notebook as well, but is it possible to share yours so i could take a look at it as well? sorry for the inconvenience!

Hi, could you please share the data files with me mentioned above with me? I can't find it. My email address is zzzxp111@gmail.com. Thank you so much!!!

nitinpi0210 commented 2 years ago

@VICKY-ZZ shared my notebook with you : https://colab.research.google.com/drive/1faR2ByWpdbYQoVhtW971fads4HRTr96u

Let me know if you view.

VICKY-ZZ commented 2 years ago

@VICKY-ZZ shared my notebook with you : https://colab.research.google.com/drive/1faR2ByWpdbYQoVhtW971fads4HRTr96u

Let me know if you view.

Thank you sooooooo much!!!

notfindmeh commented 2 years ago

@nitinpi0210 I encountered a problem like this when I tried to run your notebook.I am using the dataset shared by the author on google-driver(all_MLB.ioe.zip). Is my dataset correct? I hope you can share your dataset. image this is my email: jiaxsongsci@gmail.com Thank you so much!!!

jpdong00 commented 2 years ago

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset. Ran to completion finally ! image

OMMMGGGG!!! Okkays I'll give it a try and let u know!!

It's great that you can run the code successfully. I still have some problems, could you please share the code and data files with me ? My email is [jpdong00@gmail.com] Thank you so much !

jpdong00 commented 2 years ago

The OIE notebook runs fine now in its completion. Thanks a lot @IS5882 for giving us the modified dataset. Ran to completion finally ! image

OMMMGGGG!!! Okkays I'll give it a try and let u know!!

It's great that you can run the code successfully. I still have some problems, could you please share the code and data files with me ? My email is [jpdong00@gmail.com] Thank you so much !

VICKY-ZZ commented 2 years ago

Thank you soooooooo much!!!!

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2022年6月9日(星期四) 晚上8:54 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [IS5882/Open-CyKG] Datasets needed for OIE and NER (Issue #9)

@VICKY-ZZ shared my notebook with you : https://colab.research.google.com/drive/1faR2ByWpdbYQoVhtW971fads4HRTr96u

Let me know if you view.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

trungdinh22 commented 2 years ago

Hi @nitinpi0210, could you please share the data files for OIE task with me mentioned above? I can't find it. My email address is ktrung2210@gmail.com. Thank you soo much !

l0renor commented 2 years ago

Hi @IS5882 , I am Interested in your work as well for my master thesis on MISP kgs. Can you share the MLB_all_csv and NER data with me as well please. My mail is l.lukas@hm.edu .

jenfung commented 2 years ago

Hello, could your share the dataset with me? It's so helpful for my project. Thanks so much! My email is jenfunf@gmail.com.

zhangshitong commented 1 year ago

@nitinpi0210 Hi, can you please share the dataset with me? Thanks soooooooo much. My email address is zxt841104@126.com

YONGXINCai commented 1 year ago

Hello, could your share the dataset with me? It's so helpful for my project. Thanks so much! My email is yxinmiracle@gmail.com

hp121389 commented 1 year ago

Hi @nitinpi0210, could you please share the modified dataset for OIE task with me mentioned above? I can't find it. My email address is pohu12138@gmail.com . Thank you soo much !

Xu4nTh0ng commented 1 year ago

hi, @nitinpi0210 i was able to run the notebook as well, but is it possible to share yours so i could take a look at it as well? sorry for the inconvenience!

hello @malcolm1232, could you share your notebook, thank you so much. My email is bvx.thong0202@gmail.com

zhangyi999-g commented 9 months ago

Hi @nitinpi0210, could you please share the modified dataset for OIE task with me mentioned above? I can't find it. My email address is zhangyiqiong999@163.com Thank you so much !

XiaoDaiY commented 8 months ago

Hi @nitinpi0210, could you please share the data files for OIE task with me mentioned above? thanks for sharing My email address is daiweinudt@163.com

Ruan3yj commented 5 months ago

@IS5882 hi,I am studying cybersecurity knowledge graphs and want to do further research and need to reproduce your project. Could you share me the modified dataset for OIE task of this paper? My email address is yajruan@163.com, please contact me. Thank you so much!!!

Ruan3yj commented 5 months ago

Hi @nitinpi0210, could you please share the dataset for OIE task with me mentioned above. My email address is yajruan@163.com, please contact me. Thank you soo much!!!!