Discrepency between sizes of different versions of NSL-KDD

InitRoot / NSLKDD-Dataset

NSLKDD Dataset for WEKA

MIT License

35 stars 24 forks source link

Discrepency between sizes of different versions of NSL-KDD #3

Closed ghisaac closed 4 years ago

ghisaac commented 4 years ago

Hello!

I would like to inquire about the discrepancy in the number of records between the version of the dataset found here and the one found at . It seems that the UNB version has significantly more records, and I would like to figure out the reason why that is.

Thanks in advance!

InitRoot commented 4 years ago

Hi,

Please provide more information. There is different versions of the datasets, many subsets. Please provide more information, maybe part of the stats etc. So I can assist you.

ghisaac commented 4 years ago

Certainly, the data set I am referring to is the one found at https://www.unb.ca/cic/datasets/nsl.html. It seems that the number of records in the full training set retrieved from that link is significantly higher than for the full training set (found under the "full -d" folder) in this repository. I suspect that the reason is because you are using a 20% subset of the actual full NSL-KDD training set. If that is that case, could you provide any reasoning as why you chose to do so?

Regards

InitRoot commented 4 years ago

Aaaw yes, let me provide some context. Full -d doesn't represent the full dataset, it represents the full attack class for the 20% subset. The 20% subset was used for the research and each attack class were split into its individual subsets.

The training sets for the research were done on the 20% subsets, the research papers, and papers that follows has some more information around it, as well as other reference to why. Prefer not to discuss this on a git issue as it's debatable and there's been numerous research around the topic.

ghisaac commented 4 years ago

Thank you for your help! I have the clarity to move further with my work.

All the best