Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.86k stars 450 forks source link

Blockchain Engineering - class of 2022 - Team Import Science #6783

Closed synctext closed 2 years ago

synctext commented 2 years ago

Your task is to gather scientific publications and engineer machine reading of scientific knowledge in BrainDAO. Thousands of scientific articles are available with Creative Commons copyright license in simple .PDF format. Get thousands of such files on each device and start processing. Use a light library for Natural Language Processing. Use the Bittorrent engine inside Superapp for efficient file sharing. Use IPv8 community to gossip new content. What does this have to do with our "Blockchain Engineering course? True, this is adding lots of data and processing on top of our blockchain-based BrainDAO. Reading: https://doi.org/10.3389/frma.2019.00002

13 years ago: material from Leonardo: https://bitbucket.org/ldalonzo/p2p-search-scientific-pubs/src/master/ . a thesis. A few lessons I learnt (COPIED):

Other related work is the MusicDAO: feel free to re-use all that code. First steps:

Please keep it simple, this will all fail if you try to get something as ambitious as knowledge graph operational on Android with a blockchain. Key points for grading: merged pull request on Superapp and architecture that works; performance and usability is secondairy.

synctext commented 2 years ago
sisko444 commented 2 years ago

This week's progress

Our plan for next week

Questions for the meeting

To be altered during Johan meeting monday

synctext commented 2 years ago
sisko444 commented 2 years ago

We have switched from a bottom up to a top down approach, meaning, no stub, rather we will implement sepparte functionalities and later consolidate them together into one app.

sisko444 commented 2 years ago

Keyword extraction

A word list of 20k words was found from: http://corpus.leeds.ac.uk/list.html Under a creative commons license. Considered alternatives was a larger data set, a lemmatized dataset and a paid dataset. The larger data set was clearly too large as 5mb in memory just for this purpose seems to be overkill: https://www.kaggle.com/rtatman/english-word-frequency @synctext nevermind, we asked a question but we solved it already. It looks like a healthy ammount of 60k word stems together with a 0,7 mb lighwight java stemming library will yield the best outcomme for this.

marko-matusovic-personal commented 2 years ago

Progress notes

Try our apk

https://github.com/keonchennl/trustchain-superapp/blob/db41c2887a37d458e055f1b538d3d9c552bf10da/app-debug.apk

Screenshots from the UI

| | | ### Work done - board with issues https://github.com/keonchennl/trustchain-superapp/projects/2#card-78758433 - created UI - backend for PDF parsing - backend for keyword extraction ### Discussion - contact for the musicDAO developer - verify query forwarding idea
synctext commented 2 years ago
synctext commented 2 years ago

PDFs Creative commons: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/d8/1d/ (many GBytes, many directories)

sisko444 commented 2 years ago

This week the query handler was written which includes the document ranking methods. We also have now over 300 PDF for testing. Besides that, we can also pass peer to peer messages now. Below we can see the pdf rating. image

synctext commented 2 years ago
sisko444 commented 2 years ago

Continuous scoreing of local parsed documents as a user types in the searchbar: image

synctext commented 2 years ago
sisko444 commented 2 years ago

@TODO

synctext commented 2 years ago
sisko444 commented 2 years ago

The day of reckoning has come and we have to make our final pull request to the actual mother repo. To tie the current state back to the previous feedback:

Some gifs of the app functioning: local_upload peers search_in_keywords url_upload

sisko444 commented 2 years ago

The APK file download: link This expires in one week.

synctext commented 2 years ago

Lots more polished level:

keonchennl commented 2 years ago

Some changes that have been merged into master involves with a invalid library (info.blockchain.api 1.1.4), which breaks the master pipeline. https://github.com/Tribler/trustchain-superapp/pull/113#issuecomment-1118951508 https://github.com/Tribler/trustchain-superapp/pull/113#issuecomment-1119337108

devos50 commented 2 years ago

This work has been completed, closing the issue 👍

synctext commented 2 years ago

LiteratureDAO Source code is here?? https://github.com/keonchennl/trustchain-superapp/tree/lit-dao/literaturedao Related work: Novel public review model, great idea. use pre-print services, public review process. No more rejects or accepts

Great dataset: [170,919 Creative Commons articles in the arXiv for biology](http://api.biorxiv.org/reports/content_summary)
synctext commented 1 year ago