Add exploratory data analysis, more data preprocessing and features, and more models

Here's @0mwh's and my solution so far. We plan to attempt all 3 UnitaryHack challenges. There are several changes, we look forward to any questions and feedback.

Notebooks

QRNG_ Classification_Main_UnitaryHack windowed.ipynb for models with more training data from data/QRNG_ Classification_Main_UnitaryHack windowed_preprocessed_df_1717557318.csv.zst (generated in same notebook)
QRNG_ Classification_Main_UnitaryHack.ipynb for more models, preprocessing, and exploratory data analysis.
process_logical_reduction.ipynb for distribution analysis and statistical testing

Changes so far

[x] Use sliding window of 100 bits to generate more training data
- Any subsequence of 100 bits is also generated by the same quantum computer
[x] Compare results against classical PRNGs
- Impossible to classify classical PRNGs unless noise is added
[x] Exploratory data analysis
- [x] Check frequencies of bitstrings
- Each label has a set of unique bitstrings, so it should be possible to them apart
- [x] Mann-Whitney U test to tell distributions apart
- We can tell quantum computer 4 apart from the rest, but 1,2,3 are quite similar
- [x] Use PCA, tSNE, UMAP to determine clustering of bitstrings and features
- The bitstrings themselves are not informative
- Need some computed features
- Features become more informative with larger bitstrings
[x] Add more features
- Import errors need to be fixed by hand, only a few files
- [x] Steve Nang's NIST Library - https://github.com/stevenang/randomness_testsuite/issues/21
- [x] spkit entropy measures - https://github.com/Nikeshbajaj/spkit/issues/11
[x] Make features usable for ML
- [x] remove NaNs
- [x] remove features with identical values
- [x] mean and 0-1 normalize features
- [x] avoid test/train leakage
- [x] under/oversample
[x] Train/test models
- [x] add threads to speed up computation
- [x] more sklearn models
- Naive Bayes, K-means

Best performance so far

We got 67% on one of our models, but we caution against further interpretation until we implement more robust model testing. A limitation is that we don't have a held-out test set for a fair comparison against other project submissions.

Next steps

[ ] Model tuning
- [ ] balanced class weights
- [ ] k-fold cross validation
- [ ] hyperparameter sweep
[ ] Generalized quantum circuit
- [ ] Qiskit Quantum Volume Circuit

dorahacksglobal / qc-classifier

Add exploratory data analysis, more data preprocessing and features, and more models #9

Notebooks

Changes so far

Best performance so far

Next steps