Closed cjshearer closed 4 years ago
I think it is a good plan for task 2 option 2.
The predict
function of the Ensemble
class (in athena.py
) can give you the matrix of raw predictions directly (something you want for the first part).
e.g., get raw predictions
ensemble = Ensemble(classifiers=wds, strategy=ENSMEBLE_STRATEGY.AVEP.value)
raw_prediction = ensemble.predict(x, raw=True)
if you need the final prediction (a vector of probabilities for each class. e.g. [0.1, 0.9, 0, 0, 0, 0, 0, 0, 0, 0]
))
ensemble = Ensemble(classifiers=wds, strategy=ENSEMBLE_STRATEGY.AVEP.value)
prediction = ensemble.predict(x)
@cjshearer very good summary, I loved the details. Our thoughts were to learn some weights associated with each WD with some approaches such as AMC-SSDA
. Where this lies in your plan?
One thing I would like to mention here, you need to train your model on the training data and test it on the test data. Do not train and test the model on the same dataset.
If you want to train on a data set that mixes both benign samples and adversarial examples, I would suggest you separate the AEs into 2 portions (say, 80% for training and 20% for testing).
subsampling
function in utils.data
can help you to separate the dataset into 2 independently-identical-distribution (i.i.d.) portions (i.e., the two subsets have identical distributions), with minor updates (modifications in utils.data.subsampling
):
# shuffle the selected ids
random.shuffle(sample_ids)
# get sampled data and labels
subsamples = np.asarray([data[i] for i in sample_ids])
sublabels = np.asarray([labels[i] for i in sample_ids])
# insert the following statements
# store the rest samples in a separated array
notselected_samples = np.asarray([data[i] for i in range(pool_size) if i not in sample_ids])
notselected_labels = np.asarray([labels[i] for i in range(pool_size) if i not in sample_ids])
# save all the subsets
# ...
Then, you can get your training & testing data by subsample from a 10K dataset with a ratio
equals 0.2
(subsamples
is the testing data and notselected_samples
is the training data) or 0.8
(subsamples
is the training data and notselected_samples
is the testing data).
Check the tutorial Task2_LearningBasedStrategy
for more information.
Thanks for the advice and instruction @MENG2010. We will be sure to properly separate the data into a test and training set.
@pooyanjamshidi for now, we are just learning a simple, 3 layer model (see below). The training/testing data needed for 16 models x 19 transformations of the MNIST set x 35MB per 10k predictions = 10.64GB. Compression brought it down to 126MB, but storing/sharing larger datasets, as would be required for the AMC_SSDA
approach, would take far more space than would be reasonable to store on GH (without activating git-lfs, which is disabled for public forks). Other solutions I've found are either paid, or would take too long to setup.
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape=(wds._nb_classifiers, 10), name='WD_layer'),
keras.layers.Flatten(),
keras.layers.Dense(units=100, activation='relu', name='D1'),
keras.layers.Dense(10, name='output_layer', activation='softmax')
])
Task 2 has been submitted.
I spent a few hours this past Thursday coming up with a general breakdown of how I think we (Team Ares) should approach task 2, but I thought I should share it with others to get some feedback and maybe provide some direction for anyone feeling lost. Note that this was originally written with my team in mind and has some specific direction for them, hopefully that is not too distracting (or perhaps it may even be helpful). @MENG2010 If you have the chance, I would specifically like your feedback on whether the scope of this plan is appropriate for this task, although any feedback is appreciated.
Informal Breakdown of Our Approach
I've broken task 2 down into three steps. The first is to create some training data, from which we can learn an ensemble strategy. The second is to select a learning model, then train and test variations of that model. The third is to summarize our results and approach in the report. Throughout the entire process and for each step you complete, write down a summary of what you did and what you learned (where necessary). This will save us a lot of time when writing the report. Also, keep a high-level list of contributions you make to the task; it is your responsibility to make sure you get credit for your work.
Once we are done with this first part, we should have data that looks something like this table
where: