Blockchain Engineering - class of 2022 - Team Import Science

synctext commented 2 years ago

Your task is to gather scientific publications and engineer machine reading of scientific knowledge in BrainDAO. Thousands of scientific articles are available with Creative Commons copyright license in simple .PDF format. Get thousands of such files on each device and start processing. Use a light library for Natural Language Processing. Use the Bittorrent engine inside Superapp for efficient file sharing. Use IPv8 community to gossip new content. What does this have to do with our "Blockchain Engineering course? True, this is adding lots of data and processing on top of our blockchain-based BrainDAO. Reading: https://doi.org/10.3389/frma.2019.00002

13 years ago: material from Leonardo: https://bitbucket.org/ldalonzo/p2p-search-scientific-pubs/src/master/ . a thesis. A few lessons I learnt (COPIED):

Extracting text from PDFs is (was) not a trivial exercise. At that time I used https://linux.die.net/man/1/pdftohtml. These days there are much better options.
Parsing citations is a tricky exercise. I used this tool https://github.com/knmnyn/ParsCit. I saw they perfected it using deep learning.
I wrote code to manually build an inverted index to support full-text search. There's probably something off the shelf that can be reused and I could have better spent the time elsewhere.
I wrote code to manually cluster documents using Latent Semantic Analysis. Again, there's probably some library out there that does the same and I could have better spent the time performing measurements on how clustering works on very large collections;

Other related work is the MusicDAO: feel free to re-use all that code. First steps:

[ ] compiling the superapp from the source
[ ] select a library and try to get this Natural Language Lib compiling for Android
[ ] read the pointer on this ticket + read the IPv8 documentation https://py-ipv8.readthedocs.io/ + Trustchain https://trustchain.readthedocs.io/en/latest/trustchain.html.
[ ] (Manually) create a directory of .PDF files to parse. Creative Commons. At least 25 article for next meeting. (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/)
[ ] Week 4 goal: automatically parse a .PDF file, extract good possible keywords for possible user search, build 1 index.
[ ] Week 6 goal: distributed search

Please keep it simple, this will all fail if you try to get something as ambitious as knowledge graph operational on Android with a blockchain. Key points for grading: merged pull request on Superapp and architecture that works; performance and usability is secondairy.

synctext commented 2 years ago

please self-organise efficiently and avoid overlap
- Just stick with pdf-to-html for now and get that working first, one week with 2 people?
- Spend 1 day to get the first dataset
- get 10 examples PDFs, manual pdf-to-html cmdline
- get the citation parser going
- Or is there no Kotlin/Java parser available? (male first names dataset)
- Read: http://ceur-ws.org/Vol-2563/aics_25.pdf "GIANT is a large dataset with 991,411,100 XML labeled reference strings" If parsing of citations is unsolved, focus on Libtorrent sharing, relevant keyword ranking, IPv8 remote search, and easy injection of new PDFs.
- integrate pdf converter plus parser for this week sprint. No superapp yet?
All have the Superapp compiling
No parsing of .PDF yet on Kotlin
Found a URL, no collection of PDF files yet. Note Creative Commons. https://api.semanticscholar.org/corpus/download/https://api.semanticscholar.org/corpus/download/
Sisko: found various libraries of .PDF conversion and citations parsing.
- https://github.com/knmnyn/ParsCit Requires Perl :stop_sign: :no_entry:
- https://github.com/WING-NUS/Neural-ParsCit (library from hell, "Python 2.7 (works in Python 3 but not fully supported), with Numpy, Theano and Gensim installed. scikit-learn is needed for model evaluation if you are training a new model.")
- https://github.com/allenai/s2search
- https://github.com/itext/itext7/graphs/contributors (library from hell, XML parser, bar codes, full blown SVG, etc.)
- https://github.com/topics/pdf-to-html (entry point)
- https://www.google.com/search?q=citation+parsing+site%3Agithub.com (google entry point)
Key focus: get the whole chain of parse pdf, extract keywords, local search, and distributed search. efficiency is secondary

sisko444 commented 2 years ago

This week's progress

We got a PDF parser with wrapper for compatibillity with Apache license: https://github.com/TomRoush/PdfBox-Android
Visual: https://imgur.com/a/Y1GBZYN
MusicDao code was copied and used as a basis for a stub
Reading into the android basics of project structure and fragments
Clearer tasks to be picked up made
10 PDF's next to src folder

Our plan for next week

First make a more propper and compiling stub (Sisko rush monday)
Fragments for: catalog, reader, document, dialogue for upload, search, search result
The uploading and saving of PDF into trustchain
Maybe some NLP, keyword extraction or extra citation parsing endeavours
Make repo public: https://github.com/keonchennl/trustchain-superapp

Questions for the meeting

No images parsing
What kinds of keyword extration / NLP will be next after the MVP

To be altered during Johan meeting monday

synctext commented 2 years ago

[ ] please have some .APK, otherwise you're really behind schedule with this course in WK6.
[X] select local .pdf file on Android. Parse PDF to HTML (now to text for MvP) (not yet done for 10 pdf dataset) Show top-10 words
- [ ] extract 10 most used keywords from this article (naive approach, no natural language parsing)
- [ ] normalise with average word frequency
- [ ] search for top-10 keywords in local files only
- [ ] everything in main memory, no sqlite details
[ ] ~~https://github.com/knmnyn/ParsCit Requires Perl :stop_sign: :no_entry:~~ No cloud-free citation parser for Java, Android
[ ] integrate the above MvP inside Superapp
[ ] download from others (use Libtorrent seeding, share magnet links inside IPv8 overlay) (note new PR for MusicDAO which is build around magnet sharing)
[ ] search for articles using simple keyword matching (use local files and remote search example)
- [ ] https://github.com/rads/sqlite-okapi-bm25 (dont try full text search, use top-N keywords)
[ ] external reader (open .PDF to read)

sisko444 commented 2 years ago

We have switched from a bottom up to a top down approach, meaning, no stub, rather we will implement sepparte functionalities and later consolidate them together into one app.

sisko444 commented 2 years ago

Keyword extraction

A word list of 20k words was found from: http://corpus.leeds.ac.uk/list.html Under a creative commons license. Considered alternatives was a larger data set, a lemmatized dataset and a paid dataset. The larger data set was clearly too large as 5mb in memory just for this purpose seems to be overkill: https://www.kaggle.com/rtatman/english-word-frequency @synctext nevermind, we asked a question but we solved it already. It looks like a healthy ammount of 60k word stems together with a 0,7 mb lighwight java stemming library will yield the best outcomme for this.

marko-matusovic-personal commented 2 years ago

Progress notes

Try our apk

https://github.com/keonchennl/trustchain-superapp/blob/db41c2887a37d458e055f1b538d3d9c552bf10da/app-debug.apk

Screenshots from the UI

|

### Work done - board with issues https://github.com/keonchennl/trustchain-superapp/projects/2#card-78758433 - created UI - backend for PDF parsing - backend for keyword extraction ### Discussion - contact for the musicDAO developer - verify query forwarding idea

synctext commented 2 years ago

Solid progress in Week 6 (60% done of course, if nominal and linear) :confetti_ball:
Please have a well tested prototype APK for next sprint meeting
Few day task, 1 person responsible for getting more than 11 .PDF test cases
Scientific grounding: https://scholar.google.co.uk/scholar?q=relative+word+frequency+information+retrieval
About sharing with others feature.
- Include a magic: collect 1 new random .PDF from the network per 60 second (user config)
- magnet link based
- Gossip, spread, and query: 15-years ago work by the Tribler lab
- no query broadcasting (not incentive compatible)
- collect, parse and also conduct a remote search query of direct neighbors
- Assume random strangers on the Internet can be trusted
- RemoteQuery example: from 1 phone to 10 neighbors LiteratureDAO Query:mars isru methane
hopefully close the loop next sprint: parse, query, gossip
Remember, you need to have an accepted PR on the superapp as a requirement for this course. (wrap up 13,14 April?)
- do PR of finished parser only part?

synctext commented 2 years ago

PDFs Creative commons: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/d8/1d/ (many GBytes, many directories)

sisko444 commented 2 years ago

This week the query handler was written which includes the document ranking methods. We also have now over 300 PDF for testing. Besides that, we can also pass peer to peer messages now. Below we can see the pdf rating.

synctext commented 2 years ago

No much visible progress
4 people know and understand IPv8, non-visible learnings
Concerned about the integration of 6 people and branches.
Very concerning, unable to install an APK which is compiled from sources on Android (only emulator).
Recommend doing weekly fixed meeting. The Wednesday morning slot should be open for everybody.
Assessment, before 15 April. Or move into week 4.1
Have an tested .APK for next meeting (otherwise you're behind schedule)

sisko444 commented 2 years ago

This week I implemented the parsing of PDF's as a coroutine to make it a non stopping process. However it still stops (intuitively i think that is becouse of the nature of the task, not the thread executing it) I also worked on storing and loading the metadata of PDF files for search queries and made sure we can run the app on a physical mobile device.
Peter and Rahul worked on passing the search string from the gui to the backend and implementing a PDF import button into the GUI. Throwing the intend and getting the results back in logging, still a work in progress.
Keon has been working on seeding for downloading the PDF's from peers. I an Keon settled on an architecture to settle queries. We think its best to save the keywords locally and transmit only the results of a query after doing a local comparison.
- Quinten worked on the UI layout and also worked on asynchronous tasking. He and I will look at this together more to make the PDF parsing non-stopping.

Continuous scoreing of local parsed documents as a user types in the searchbar:

The tested working apk can be dowloaded through we transfer, it is zipped: https://we.tl/t-ZnaUbho1X0

synctext commented 2 years ago

installs, but fails to run completely; crash of your app. :astonished: (end of week 8, so getting nervous/tight)
No idea what the GUI will show????
Please get the .PDF parsing stable as background task. Standard dispatcher of Android.
Storage of .PDF files, local app context only app-restricted storage, and import new .PDF from any URL.
Feel free to copy this approach; their libtorrent and EVA protocol fallback; plus EVA protocol fix.
Just to spread the files: copy the above approach, gossip magnet link to neighbors, try downloading for 30 seconds or so, fallback to EVA protocol
By default download all .PDF files that you hear about, security must be ignored, that is for class of 2023.
Whole network thus gets to hear about all .PDF files eventually
Remote search: ask neighbors to check their local stored files

sisko444 commented 2 years ago

PDF selection from internal storage is implemented.
The local file storage is now implemented to work with persistence.
The parsing of PDFs is now mostly ran in a coroutine, the part that isn't, still can't for unknown reasons. (it doesnt stop anymore in the simulator and has about 10 seconds of black screen on a phone test)
EVA was implemented, once a file is parsed, its torrent is generated and will be broadcast to all peers, every 10 seconds.
Every 20 seconds the client attempts to download a piece of literature from a received torrent. (This is not working optimally yet, a later commit made it work less good)

@TODO

The remote search back-end is finished, the front end is still in development.
The display of parsed documents and displaying of EVA operations is also still in development.
Make it so the entire PDF parsing is in a coroutine.
Repair whatever is hindering the performance of the downloading using the torrents. The link to the APK: https://we.tl/t-rLd6GS7Pdt

synctext commented 2 years ago

app works !!
- blocking main thread upon parsing .PDF
- no showing yet of .PDF metadata to replace "Lorem ipsum".
lots of stuff happening in background and things are coming together {hopefully} soon.
"select file to freely share and torrent around the world", functionality of the 'select file' button
exact click on magnification glass required to get inside keyword entry
Tip: replace 2 "Lorem ipsum" boxes with something useful. (progress bar when parsing) Small text "no files found yet. Please add some". Example scientific .PDF. Behaviour of MusicDAO: fills screens 2 seconds after start, 300 items after 20 seconds-ish.
Final course Pull Request: Expect to need 1+ week to get feedback, process feedback, {repeat} and get it polished.
Bonus: restrict to local files by the Superapp only. Import .PDF through typing a URL. No global files system read permission. Give user the choice between invasive permissions. They are hidden behind a button. "access files (warning requires broad permissions)".

sisko444 commented 2 years ago

The day of reckoning has come and we have to make our final pull request to the actual mother repo. To tie the current state back to the previous feedback:

App still works! No more blocking the main thread and we show the .pdf files.
Things are very much coming more together.
The upload button now shows a large warning: THIS WILL BE DISTRIBUTED above it.
The whole search bar is now clickable.
This tip is not necessary, the boxes are populated with the actual PDFs.
Sadly we were still working on it until now and there are still test cases to be solved, because of that there were no pull requests yet.
We now ask for permissions when the app launches for the first time, and there is the option to import PDFs though UTL's aswel.

Some gifs of the app functioning: local_upload peers search_in_keywords url_upload