Open JayDew opened 4 years ago
Progress meeting:
Related work reading: https://jhui.github.io/2017/01/15/Machine-learning-recommendation-and-ranking/ (specifically collaborative filtering) What application are we targeting? Still real DNA data or our own "MusicDAO" app on Google Play store? (See prior brainstorm with 900GByte dataset, etc.)
3 weeks timeline:
(song1-song5 really similar, song1-song2 kinda similar, song4-song5 really not similar)
-further improvements:
Progress: it now works for a 5 by 5 matrix!
translated collaborative filtering into linear regression derivative
it does not need to scale, limited number of artistic items
still stores the sparse matrix inefficiently
custom code, not heavy&general https://github.com/eclipse/deeplearning4j
Dataset exists
get it to work for bigger matrix, better data structure?
3 week sprint
not much progress
Launch of your recommender AI and MusicDAO scheduled for 7-11December 2020, https://dicg2020.github.io/ Please try to have the first skeleton operational in November.
todo:
:laughing: Been years since I found a code bug myself... http://www.cs.bilkent.edu.tr/~guvenir/courses/CS101/op_precedence.html
Please make a running example with 10 x 10 matrix or other example you can manually calculate for full correctness.
12h spent on honor project. Manual calculation. Excellent start for scientific paper, as final deliverable. Target Superapp integration first, then perfect this algorithm, in realistic setting. ToDo: Download latest code, contact @xoriole about the seedbox with many songs, and get familair with the code. https://github.com/Tribler/trustchain-superapp/issues/45 Note the Continous integration, https://github.com/Tribler/trustchain-superapp/actions Goal, truely distributed AI, integrated for music recommendation, pushed to Google Play Store.
ToDo: full custom bachlor thesis, explore, Monday 19 April (week 4.1), team of 4. Explore an isolated microeconomy with "existential freedom" that serves as a training ground for alternatives to capitalism by employing large-scale collaboration between individuals. No fantasy project, real governance, running code with Blockchain, AI, and democratic voting mechanism which does away with winner-takes-all systemic bias in capitalism.
X-mas vacation, spent 12h on honor project. Cleaning of own code. ToDo1: check months left of honor project. ToDo2: compile superapp from sources, use the knowledge of Tim. Has gossip operational using IPv8 of other phones. Superapp is going strong:
OK, then simplify the technical engineering side. We can drop all the Android complexity. This allows simpler PC-only development, cmdline running, text input/output, no GUI stuff anymore, and ease of usage. Standard Kotlin, read files, output performance graphs. Yes, you re-discovered the ground-truth problem! Its hard to establish want is the good recommendation. All this stuff is subjective and about artistic appreciation. Similarity can be a math construct, so can be validated manually with small datasets. Test with disjoint taste profiles? Next sprint goal: performance analysis. Benchmark running time, scalability (10,100,..1M) when using small/large dataset, average similarity of recommendation, etc. Read about the 80/20 method. Use 80% of your dataset for training, 20% for testing. Demo in 2 weeks to other honor students.
2 days
did demo to other honor students
scalabality/performance graphs for manually validated dataset
scaling total iterations with total nr of machines
initializing similarity matrix to random values so that there is no privacy loss in the first steps
since i didn't yet test on a dataset with 80/20, still unclear whether to increase similarities for false false occurances
Progress running quite low.
low progress unexpectedly busy with BEP, coming weeks will only be worse possibilites of finishing this? result is not great but is it good enough for a paper?
Doing majority of honour project in June,July. Tough, be doable! http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt
downsampling: read first 10000 lines, randomly selected 1000 lines from them (to get rid of local correllation?)
Please push latest code (also as backup).
Infrastructure idea: 1 week quick hack. Create a single cmdline Python script: downloads track file, use Pony ORM, SQLite, inserts into DB, does processing, and shows stuff in browsing mode. Try to create a big datafile using 48h of processing about item-to-item correlation from this dataset. Get quantitative data on overlap and feeling for how to use it best, understand level of pollution (near-duplicates?). For instance, for each item list: number occurrence in playlists, Top-10 most similar items (Person's correlation please), etc. Recommend to keep separate first, you can use this as starting point to move your distributed machine learning into item-to-item recommendation.
(btw BEP is analysing current algorithms for non-RSA cryptography; based on hardness of multivariate equations)
Status: bsc completed :clap: :partying_face: Wrapping up Honors article in 7-10 days sounds unrealistic. Would recommend to enjoy the summer, now that Corona is low.
Peer to peer protocol for communication
The issue is done when the following features are implemented:
Action plan