adamjanovsky / AndroidMalwareCrypto

The analysis of cryptography in Android malicious applications
MIT License
3 stars 0 forks source link

Host latest dataset and create API to download it #20

Open adamjanovsky opened 2 years ago

adamjanovsky commented 2 years ago

We should find suitable service to host merged and cleaned records.json, together with feature CSV files. Subsequently, we should write a simple API to fetch the latest dataset from somewhere. Something like get_dataset_from_web_latest().

I previously commented that publishing feature vectors and labels (csv files) will suffice, but I reconsidered that. Having records json may be valuable as well.

adamjanovsky commented 2 years ago

@dmacko232 Sorry to confuse things again, but we probably won't be able to publish records.json directly. Apparantely, some information contained within the file (lines of code, names of classes) is to be considered confidential and releasing it would violate our NDA with Avast and service terms of Androzoo as well.

Sharing cleaned features is fine though :).

adamjanovsky commented 2 years ago

@dmacko232 , I was evaluating viable options for dataset sharing. Some criteria that I've taken into account:

Kaggle looked good, but requires an account for anyone to fetch the dataset. I resorted to https://data.mendeley.com/. Some quick summary:

I believe (and sorry to complicate things again) we should, instead of hd5, provide a single csv file with all features and labels. Split into train/test, features/labels should then take place as a part of training pipeline. The motivation is that anyone could then fetch the full dataset and use it independently on our tool. In current state, one must somehow use our tool.

I created a draft dataset and published some dummy files there. See the result: https://data.mendeley.com/v1/datasets/b6hb7dk467/draft?a=3f2b36eb-fc38-411b-ba26-8aa51f6c15ea

So, if we agree on the service and data format, I'll upload the actuall datasets once we have them. I can also write the API for downloading the datasets.