JayDew commented 4 years ago

Peer to peer protocol for communication

The issue is done when the following features are implemented:

[ ] secure and reliable communication between peers
[ ] peer discovery system
[ ] test scalability (for up to 10.000 nodes)
[ ] refactor code to facilitate communication

Action plan

[ ] research Java socket communication (SSL SocketFactory)
[ ] different abstract level for communication
[ ] research possible DHT options
[ ] test locally

synctext commented 4 years ago

Progress meeting:

Not always easy to find time for Honors project is curriculum
Working on protocol side with Trustchain on Android
on-hold: machine learning side of things
Next steps:
- communication between peers: direct phone-to-phone communication
- go beyond standard "Hello World" button
- screenshot on this ticket for next meeting
- idea: single number input, calculate something cool with AI magic :star2: :stars: :star2:
Far future: real DNA data and some open source analysis tooling
Bachelor thesis, Really-far future, do this with a Leiden or US/Boston Lab with real DNA scientists who know little of distributed machine learning

synctext commented 3 years ago

Related work reading: https://jhui.github.io/2017/01/15/Machine-learning-recommendation-and-ranking/ (specifically collaborative filtering) What application are we targeting? Still real DNA data or our own "MusicDAO" app on Google Play store? (See prior brainstorm with 900GByte dataset, etc.)

mateicristea88 commented 3 years ago

3 weeks timeline:

get familiar with MusicDao issue ("MusicDAO")
investigate whether present multilinear distributed ML can work with recommandation algorithms (https://jhui.github.io/2017/01/15/Machine-learning-recommendation-and-ranking/)
have a small (local) demo of multilinear distributed ML recommending something

mateicristea88 commented 3 years ago

local (not in MusicDao app) prototype of music recommender
constructing item-item similarity matrix, based on each user's listening history

(song1-song5 really similar, song1-song2 kinda similar, song4-song5 really not similar)

adaptations of hungarian math:
model becomes the whole item-item matrix, so each user needs to store it (trouble ahead with scaling to real numbers of users and songs)
update function goes from linear regression derivative to something freeballed

math updateFuncion

-further improvements:

go from item-item 2d sparse matrix to array of key-value maps (only remember most similar 20 items for each item)
possibly switch to each user only storing a partial model (that they're interested in)
test performance with real data

synctext commented 3 years ago

Progress: it now works for a 5 by 5 matrix!
translated collaborative filtering into linear regression derivative
it does not need to scale, limited number of artistic items
still stores the sparse matrix inefficiently
custom code, not heavy&general https://github.com/eclipse/deeplearning4j
Dataset exists
get it to work for bigger matrix, better data structure?

mateicristea88 commented 3 years ago

3 week sprint

get it to work for a 1000x1000 matrix (with better data structures, sparseMatrix)
get a dataset and have some results (input dataset songs, get recommandations)
maybe try to have multiple partial models

mateicristea88 commented 3 years ago

not much progress

switched from a float[][] to a sparse Matrix
issue with the chosen library, sparseMatrix + sparseMatrix != sparseMatrix
it still works good for a 5x5 matrix
started on having multiple partial models (each user only has a couple of rows in their model, the song rows they have listened to), but there's still debugging to do

synctext commented 3 years ago

Launch of your recommender AI and MusicDAO scheduled for 7-11December 2020, https://dicg2020.github.io/ Please try to have the first skeleton operational in November.

mateicristea88 commented 3 years ago

Capture

Capture5

todo:

clean code and push it to git
rework hungarian math code
have 10 by 10 example next to it (songs with 0, 0.25, 0.5, 0.75, 1.0 similarity)
map song_id to song_name (from http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt)

synctext commented 3 years ago

:laughing: Been years since I found a code bug myself... http://www.cs.bilkent.edu.tr/~guvenir/courses/CS101/op_precedence.html

== High priority
&& Lower priority

Please make a running example with 10 x 10 matrix or other example you can manually calculate for full correctness.

mateicristea88 commented 3 years ago

manually calculated example: 1-8 users, a-f songs

202012160938231000

for classic algorithm, 10, 100 and 1000 iterations

sim10 sim100 sim1000

but this leads to inflated results (user1 listens to 5 songs, the other 995 songs have their similarity increased)

result

if we don't increase similarity for 2 songs that are both not listened to by any given user

false

understandibly our manually calculated example is "worse"

sim1000alt

but examples have lower, more realistic values

resultsalt

synctext commented 3 years ago

12h spent on honor project. Manual calculation. Excellent start for scientific paper, as final deliverable. Target Superapp integration first, then perfect this algorithm, in realistic setting. ToDo: Download latest code, contact @xoriole about the seedbox with many songs, and get familair with the code. https://github.com/Tribler/trustchain-superapp/issues/45 Note the Continous integration, https://github.com/Tribler/trustchain-superapp/actions Goal, truely distributed AI, integrated for music recommendation, pushed to Google Play Store.

ToDo: full custom bachlor thesis, explore, Monday 19 April (week 4.1), team of 4. Explore an isolated microeconomy with "existential freedom" that serves as a training ground for alternatives to capitalism by employing large-scale collaboration between individuals. No fantasy project, real governance, running code with Blockchain, AI, and democratic voting mechanism which does away with winner-takes-all systemic bias in capitalism.

synctext commented 3 years ago

X-mas vacation, spent 12h on honor project. Cleaning of own code. ToDo1: check months left of honor project. ToDo2: compile superapp from sources, use the knowledge of Tim. Has gossip operational using IPv8 of other phones. Superapp is going strong:

mateicristea88 commented 3 years ago

spent about 24h
code is still in local java, with simulated independent entities
currently scales ok, works with as many devices as there are unique user ids in http://millionsongdataset.com/pages/example-track-description/
experience with android is lacking, so sending data over trustchain superapp too difficult
maybe I could use sockets/ipv8 to send data on local machine between different threads
end of march, demo for other students deadline
1 august, final final deadline, recommended to have it finished and graded and submitted before

synctext commented 3 years ago

create Superapp "icon app" with socket IPv8 community
- Replace your prior "Distributed AI" icon code
- Display discovered peers on the Kotlin screen, exchange music playlists
- no recommendation yet, clean start MvP
- learn to master Android Kotlin
Your scientific report due 1 Aug: need performance graphs
- measure deployed Superapp
- listen to federated learning gossip
Tutorials: https://py-ipv8.readthedocs.io/ plus the scientific: https://github.com/grimadas/BlockchainEngineering#topics-covered
Ready to gossip playlists with "Hello World" and this obviously
next possible sprint goals 1) start cmdline IPv8 clients on various sockets 2) start Kotlin Superapp Distributed AI, 3) they see each other 4) they gossip info 5) AI magic

mateicristea88 commented 3 years ago

2 days of work on the tutorials
mental burnout coming from lack of experience with kotlin android and deadline stress. also falling behind on ewi courses
since master students are already working on expanding this, I'd like to step away from android
maybe focus on local java code, try to improve ML by more rigorous benchmarking?
will have to prepare a presentation of the project by the end of march
final report has to be done and approved by 1st of august

synctext commented 3 years ago

OK, then simplify the technical engineering side. We can drop all the Android complexity. This allows simpler PC-only development, cmdline running, text input/output, no GUI stuff anymore, and ease of usage. Standard Kotlin, read files, output performance graphs. Yes, you re-discovered the ground-truth problem! Its hard to establish want is the good recommendation. All this stuff is subjective and about artistic appreciation. Similarity can be a math construct, so can be validated manually with small datasets. Test with disjoint taste profiles? Next sprint goal: performance analysis. Benchmark running time, scalability (10,100,..1M) when using small/large dataset, average similarity of recommendation, etc. Read about the 80/20 method. Use 80% of your dataset for training, 20% for testing. Demo in 2 weeks to other honor students.

mateicristea88 commented 3 years ago

2 days
did demo to other honor students
scalabality/performance graphs for manually validated dataset
scaling total iterations with total nr of machines
initializing similarity matrix to random values so that there is no privacy loss in the first steps
since i didn't yet test on a dataset with 80/20, still unclear whether to increase similarities for false false occurances

synctext commented 3 years ago

By removing the deployed code element, pure graphs, this assignment moved into mainstream with 20 years of prior work
Move beyond small synthetic 8 song dataset
display exact datapoints, not just a nearly straight line. That removes information about sampling.
Please use collaborative filtering method, not signal processing (feature-based).
Next target after mastering 80/20 stuff: precision and recall experiments.
Goal: implement well known algorithms, beat random guessing, and pick most popular songs performance by significant margin
Related students work, federated learning operational: https://github.com/Tribler/tribler/issues/5987#issuecomment-808951471
Decade old overview, 15 algorithms for collaborative filtering: https://arxiv.org/pdf/1205.3193.pdf
Decade old survey: https://downloads.hindawi.com/archive/2009/421425.pdf
What direction to take with real music? 900GByte of music 106.000 track Creative Commons (repeated from above). We will have that in our Superapp soon.

mateicristea88 commented 3 years ago

used milion datset (from http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt)
downsampling: read first 10000 lines, randomly selected 1000 lines from them (to get rid of local correllation?)
80-20 method: use 80% of dataset for training, (expected similarity)
for the 20% testing set we compute similarities, and compare them to the training dataset: mean squared errors between expected similarity and actual similarity
low collision rate (between songs in 20% and songs in 80%), but suspiciously good error?
worked 2 days on this

synctext commented 3 years ago

Progress running quite low.

mateicristea88 commented 3 years ago

low progress unexpectedly busy with BEP, coming weeks will only be worse possibilites of finishing this? result is not great but is it good enough for a paper?

synctext commented 3 years ago

Doing majority of honour project in June,July. Tough, be doable! http://millionsongdataset.com/sites/default/files/AdditionalFiles/unique_tracks.txt

downsampling: read first 10000 lines, randomly selected 1000 lines from them (to get rid of local correllation?)

Please push latest code (also as backup).

Infrastructure idea: 1 week quick hack. Create a single cmdline Python script: downloads track file, use Pony ORM, SQLite, inserts into DB, does processing, and shows stuff in browsing mode. Try to create a big datafile using 48h of processing about item-to-item correlation from this dataset. Get quantitative data on overlap and feeling for how to use it best, understand level of pollution (near-duplicates?). For instance, for each item list: number occurrence in playlists, Top-10 most similar items (Person's correlation please), etc. Recommend to keep separate first, you can use this as starting point to move your distributed machine learning into item-to-item recommendation.

(btw BEP is analysing current algorithms for non-RSA cryptography; based on hardness of multivariate equations)

mateicristea88 commented 3 years ago

no time for honors, busy with bachelor end project ( finished yesterday, last 2-3 weeks were intense)
going on vacation for 10 days but then coming back for possibly more work
also possibility of quitting

synctext commented 3 years ago

Status: bsc completed :clap: :partying_face: Wrapping up Honors article in 7-10 days sounds unrealistic. Would recommend to enjoy the summer, now that Corona is low.

Tribler / distributed-ai-kernel

Decentralized peer to peer #1

Peer to peer protocol for communication

The issue is done when the following features are implemented:

Action plan