Open kipkurui opened 6 years ago
Acknowledged. Will make another presentable diagram and upload.
@Marysteph @Silviane-m Can you handle this please? See above I'll provide what you need.
I have added the Roadmap for you guys. Let us focus on addressing each milestone one at a time. You can create a branch to focus a milestone of interest. All the best
Thanks Caleb! Will review and revert during the week. Sorry for weird back and forth.
@kipkurui I think I can use some relevance ranking code normally used for recommender systems to figure out collaboration trends with machine learning using this formula from term frequency inverse document frequency(Tf-idf) term frequency/length of document here's some preliminary work:
`In [10]: docs = ['Ahlberg S Grace D Kiarie G Kirino Y Lindahl J']
In [11]: docs.append('Haldeman S Johnson CD Chou R Nordin M Côté P Hur
...: witz EL Green BN Cedraschi C Acaroğlu E Kopansky-Giles D Amei
...: s A Adjei-Kwayisi A Ayhan S Blyth F Borenstein D Brady O Br
...: ooks P Camilleri C Castellote JM Clay MB Davatchi F Dunn R
...: Goertz C Griffith EA Hondras M Kane EJ Lemeunier N Mayer J
...: Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L
...: Ngandeu-Singwe M Outerbridge G Randhawa K Shearer H Sönmez E
...: Torres C Torres P Verville L Vlok A Watters W 3rd Wong CC Y
...: u H ')
In [12]: docs.append('Haldeman S Nordin M Chou R Côté P Hurwitz EL
...: Johnson CD Randhawa K Green BN Kopansky-Giles D Acaroğlu
...: E Ameis A Cedraschi C Aartun E Adjei-Kwayisi A Ayhan S
...: Aziz A Bas T Blyth F Borenstein D Brady O Brooks P
...: Camilleri C Castellote JM Clay MB Davatchi F Dudler J Du
...: nn R Eberspaecher S Emmerich J Farcy JP Fisher-Jeffes N
...: Goertz C Grevitt M Griffith EA Hajjaj-Hassouni N Hartvigsen
...: J Hondras M Kane EJ Laplante J Lemeunier N Mayer J M
...: ior S Mmopelwa T Modic M Moss J Mullerpatan R Muteti E
...: Mwaniki L Ngandeu-Singwe M Outerbridge G Rajasekaran S Sh
...: earer H Smuck M Sönmez E Tavares P Taylor-Vaisey A Torre
...: s C Torres P van der Horst A Verville L Vialle E Kumar G
...: V Vlok A Watters W 3rd Wong CC Wong JJ Yu H Yüksel S
...: ')
In [13]: print (docs) ['Ahlberg S Grace D Kiarie G Kirino Y Lindahl J', 'Haldeman S Johnson CD Chou R Nordin M Côté P Hurwitz EL Green BN Cedraschi C Acaroğlu E Kopansky-Giles D Ameis A Adjei-Kwayisi A Ayhan S Blyth F Borenstein D Brady O Brooks P Camilleri C Castellote JM Clay MB Davatchi F Dunn R Goertz C Griffith EA Hondras M Kane EJ Lemeunier N Mayer J Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L Ngandeu-Singwe M Outerbridge G Randhawa K Shearer H Sönmez E Torres C Torres P Verville L Vlok A Watters W 3rd Wong CC Yu H ', 'Haldeman S Nordin M Chou R Côté P Hurwitz EL Johnson CD Randhawa K Green BN Kopansky-Giles D Acaroğlu E Ameis A Cedraschi C Aartun E Adjei-Kwayisi A Ayhan S Aziz A Bas T Blyth F Borenstein D Brady O Brooks P Camilleri C Castellote JM Clay MB Davatchi F Dudler J Dunn R Eberspaecher S Emmerich J Farcy JP Fisher-Jeffes N Goertz C Grevitt M Griffith EA Hajjaj-Hassouni N Hartvigsen J Hondras M Kane EJ Laplante J Lemeunier N Mayer J Mior S Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L Ngandeu-Singwe M Outerbridge G Rajasekaran S Shearer H Smuck M Sönmez E Tavares P Taylor-Vaisey A Torres C Torres P van der Horst A Verville L Vialle E Kumar GV Vlok A Watters W 3rd Wong CC Wong JJ Yu H Yüksel S ']
In [14]: from sklearn.feature_extraction.text import TfidfVectorizer
In [15]: corpus = docs
In [16]: vectorizer = TfidfVectorizer(min_df=1)
In [17]: model = vectorizer.fit_transform(corpus)
In [18]: print(model.todense().round(2)) [[0. 0. 0. 0. 0.45 0. 0. 0. 0. 0. 0. 0. 0. 0.
{'ahlberg': 3, 'grace': 26, 'kiarie': 35, 'kirino': 36, 'lindahl': 40, 'haldeman': 29, 'johnson': 33, 'cd': 14, 'chou': 16, 'nordin': 50, 'côté': 18, 'hurwitz': 31, 'el': 23, 'green': 27, 'bn': 7, 'cedraschi': 15, 'acaroğlu': 1, 'kopansky': 37, 'giles': 24, 'ameis': 4, 'adjei': 2, 'kwayisi': 38, 'ayhan': 5, 'blyth': 6, 'borenstein': 8, 'brady': 9, 'brooks': 10, 'camilleri': 11, 'castellote': 12, 'jm': 32, 'clay': 17, 'mb': 42, 'davatchi': 19, 'dunn': 20, 'goertz': 25, 'griffith': 28, 'ea': 21, 'hondras': 30, 'kane': 34, 'ej': 22, 'lemeunier': 39, 'mayer': 41, 'mmopelwa': 43, 'modic': 44, 'moss': 45, 'mullerpatan': 46, 'muteti': 47, 'mwaniki': 48, 'ngandeu': 49, 'singwe': 54, 'outerbridge': 51, 'randhawa': 52, 'shearer': 53, 'sönmez': 55, 'torres': 56, 'verville': 57, 'vlok': 58, 'watters': 59, '3rd': 0, 'wong': 60, 'cc': 13, 'yu': 61}
`
Uses probability of observing a certain piece of text and 0.45 means that that name is for a common collaborator. I'll write better code that can allow you to filter and know the name of the person better. How's that?
This sounds like a good idea. What is the input? Authors of the paper? In terms of collaboration trends, we are more interested in inter-institutional collaborations.
Link Then take the author list column. I'll refine this better. Though it will require some digging to find out the institutions the authors hence, finding out inter-institutional collaborations...
Actually I have another method I want to try will give feedback soon.
Great, I look forward to it.
Sorry I forgot about this. I just rewrote it. It is revealing trends already in collaboration especially with the author Obonyo M from preliminary results. I need help going through the first 20 TFIDF weight matrices vs the name it belonged to initially.
Hi @Shuyib Great work done, thanks. Was looking at this but I find it hard to follow since there is so much data, would be great to show a few rows? We also need to find a way of visualising the output.
Also @Shuyib, please send a pull request of any branch you feel is ready to combine with the master. Let's remove feature branches and harmonize the repo.
Ok, thanks. Actually it's a one liner to visualize it. I'll update the master on Friday or Saturday hopefully and see a few rows like you've pointed out.
@kipkurui There's a HTML file I provided with the plots and the dataframe partially called output_author+institution.html. Did you have a look at it?
@kipkurui I have sampled the rows and cleaned up the notebook. For some reason I thought my branch was absent and made another one. Please review the branch collab-graphs.
Thanks Ben, let me have a look
On Thu, May 7, 2020 at 9:36 AM Ben Mainye notifications@github.com wrote:
@kipkurui https://github.com/kipkurui I have sampled the rows and cleaned up the notebook. For some reason I thought my branch was absent and made another one. Please review the branch collab-graphs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioinfoNet/Data-mining/issues/11#issuecomment-625059755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4WPT3QBYQVFJA6CXD5DHLRQJJH5ANCNFSM4FZWBPEA .
-- ....... Caleb Kipkurui Kibet (PhD) Bioinformatics
[image: --]
Caleb kibet [image: https://]about.me/kipkurui https://about.me/kipkurui?promo=email_sig “The pessimist complains about the wind; the optimist expects it to change; the realist adjusts the sails.”
- William Arthur Ward
Status of Open Science in Kenya: Data mining
The purpose of this project is to explore the adoption of open science practices: open access, open data and open source. For this project, we download all the papers published by Kenya authors and figure out collaboration trends, and whether the papers are openly accessible.
Milestones reached
Target Milestones