BioinfoNet / Data-mining

Data mining to discover trends in Open Science in Kenya
4 stars 13 forks source link

Roadmap #11

Open kipkurui opened 5 years ago

kipkurui commented 5 years ago

Status of Open Science in Kenya: Data mining

The purpose of this project is to explore the adoption of open science practices: open access, open data and open source. For this project, we download all the papers published by Kenya authors and figure out collaboration trends, and whether the papers are openly accessible.

Milestones reached

Target Milestones

  1. Answer the following hackathon questions:
    • [ ] Collaboration Trends: Are Kenyan Authors collaborating within or without?
    • [ ] Open data and code: Are Kenyan authors making the data and code available and accessible? Are the adhering to the FAIR principles?
    • [x] Open access trends: How has the adoption of open access practices changed over time? Are Kenyan authors publishing open access?
    • [x] Preprints: Are Kenyan authors using preprints? If so, who is driving their adoption? Is it the local or foreign collaborators?
  2. Collaborative Paper
    • [ ] Include the results of this project to the collaborative hack paper
Shuyib commented 5 years ago

Acknowledged. Will make another presentable diagram and upload.

Shuyib commented 5 years ago

@Marysteph @Silviane-m Can you handle this please? See above I'll provide what you need.

kipkurui commented 5 years ago

I have added the Roadmap for you guys. Let us focus on addressing each milestone one at a time. You can create a branch to focus a milestone of interest. All the best

Shuyib commented 5 years ago

Thanks Caleb! Will review and revert during the week. Sorry for weird back and forth.

Shuyib commented 5 years ago

@kipkurui I think I can use some relevance ranking code normally used for recommender systems to figure out collaboration trends with machine learning using this formula from term frequency inverse document frequency(Tf-idf) term frequency/length of document here's some preliminary work:

`In [10]: docs = ['Ahlberg S Grace D Kiarie G Kirino Y Lindahl J']

In [11]: docs.append('Haldeman S Johnson CD Chou R Nordin M Côté P Hur ...: witz EL Green BN Cedraschi C Acaroğlu E Kopansky-Giles D Amei ...: s A Adjei-Kwayisi A Ayhan S Blyth F Borenstein D Brady O Br ...: ooks P Camilleri C Castellote JM Clay MB Davatchi F Dunn R
...: Goertz C Griffith EA Hondras M Kane EJ Lemeunier N Mayer J
...: Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L
...: Ngandeu-Singwe M Outerbridge G Randhawa K Shearer H Sönmez E
...: Torres C Torres P Verville L Vlok A Watters W 3rd Wong CC Y ...: u H ')

In [12]: docs.append('Haldeman S Nordin M Chou R Côté P Hurwitz EL
...: Johnson CD Randhawa K Green BN Kopansky-Giles D Acaroğlu ...: E Ameis A Cedraschi C Aartun E Adjei-Kwayisi A Ayhan S
...: Aziz A Bas T Blyth F Borenstein D Brady O Brooks P
...: Camilleri C Castellote JM Clay MB Davatchi F Dudler J Du ...: nn R Eberspaecher S Emmerich J Farcy JP Fisher-Jeffes N
...: Goertz C Grevitt M Griffith EA Hajjaj-Hassouni N Hartvigsen ...: J Hondras M Kane EJ Laplante J Lemeunier N Mayer J M ...: ior S Mmopelwa T Modic M Moss J Mullerpatan R Muteti E
...: Mwaniki L Ngandeu-Singwe M Outerbridge G Rajasekaran S Sh ...: earer H Smuck M Sönmez E Tavares P Taylor-Vaisey A Torre ...: s C Torres P van der Horst A Verville L Vialle E Kumar G ...: V Vlok A Watters W 3rd Wong CC Wong JJ Yu H Yüksel S ...: ')

In [13]: print (docs) ['Ahlberg S Grace D Kiarie G Kirino Y Lindahl J', 'Haldeman S Johnson CD Chou R Nordin M Côté P Hurwitz EL Green BN Cedraschi C Acaroğlu E Kopansky-Giles D Ameis A Adjei-Kwayisi A Ayhan S Blyth F Borenstein D Brady O Brooks P Camilleri C Castellote JM Clay MB Davatchi F Dunn R Goertz C Griffith EA Hondras M Kane EJ Lemeunier N Mayer J Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L Ngandeu-Singwe M Outerbridge G Randhawa K Shearer H Sönmez E Torres C Torres P Verville L Vlok A Watters W 3rd Wong CC Yu H ', 'Haldeman S Nordin M Chou R Côté P Hurwitz EL Johnson CD Randhawa K Green BN Kopansky-Giles D Acaroğlu E Ameis A Cedraschi C Aartun E Adjei-Kwayisi A Ayhan S Aziz A Bas T Blyth F Borenstein D Brady O Brooks P Camilleri C Castellote JM Clay MB Davatchi F Dudler J Dunn R Eberspaecher S Emmerich J Farcy JP Fisher-Jeffes N Goertz C Grevitt M Griffith EA Hajjaj-Hassouni N Hartvigsen J Hondras M Kane EJ Laplante J Lemeunier N Mayer J Mior S Mmopelwa T Modic M Moss J Mullerpatan R Muteti E Mwaniki L Ngandeu-Singwe M Outerbridge G Rajasekaran S Shearer H Smuck M Sönmez E Tavares P Taylor-Vaisey A Torres C Torres P van der Horst A Verville L Vialle E Kumar GV Vlok A Watters W 3rd Wong CC Wong JJ Yu H Yüksel S ']

In [14]: from sklearn.feature_extraction.text import TfidfVectorizer

In [15]: corpus = docs

In [16]: vectorizer = TfidfVectorizer(min_df=1)

In [17]: model = vectorizer.fit_transform(corpus)

In [18]: print(model.todense().round(2)) [[0. 0. 0. 0. 0.45 0. 0. 0. 0. 0. 0. 0. 0. 0.

                          1. 0.
              1. 0.45 0. 0. 0. 0. 0. 0.
                      1. 0.45 0.45 0.
        1. 0.45 0. 0. 0. 0. 0. 0. 0. 0. 0.
                          1. 0.
              1. ] [0.13 0. 0.13 0.13 0. 0.13 0.13 0. 0. 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0. 0. 0.13 0.13 0. 0.13 0.13 0. 0. 0. 0.13 0.13 0. 0.13 0. 0.13 0. 0. 0.13
    1. 0.13 0. 0.13 0. 0. 0.13 0.13 0. 0.13 0. 0. 0.13
  1. 0.13 0. 0.13 0. 0.13 0.13 0. 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0.13 0. 0.13 0.13 0.13 0. 0.13 0. 0. 0.26 0. 0. 0.13 0. 0.13 0.13 0.13 0.13 0. ] [0.09 0.12 0.09 0.09 0. 0.09 0.09 0.12 0.12 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.12 0.12 0.09 0.09 0.12 0.09 0.09 0.12 0.12 0.12 0.09 0.09 0. 0.09 0.12 0.09 0.12 0.12 0.09 0.12 0.12 0.09 0.12 0.09 0.12 0.12 0.09 0.09 0.12 0.09 0. 0. 0.09 0.12 0.09 0.12 0.09 0. 0.09 0.09 0.12 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.12 0.09 0.09 0.09 0.12 0.09 0.12 0.12 0.19 0.12 0.12 0.09 0.12 0.09 0.09 0.19 0.09 0.12]]

{'ahlberg': 3, 'grace': 26, 'kiarie': 35, 'kirino': 36, 'lindahl': 40, 'haldeman': 29, 'johnson': 33, 'cd': 14, 'chou': 16, 'nordin': 50, 'côté': 18, 'hurwitz': 31, 'el': 23, 'green': 27, 'bn': 7, 'cedraschi': 15, 'acaroğlu': 1, 'kopansky': 37, 'giles': 24, 'ameis': 4, 'adjei': 2, 'kwayisi': 38, 'ayhan': 5, 'blyth': 6, 'borenstein': 8, 'brady': 9, 'brooks': 10, 'camilleri': 11, 'castellote': 12, 'jm': 32, 'clay': 17, 'mb': 42, 'davatchi': 19, 'dunn': 20, 'goertz': 25, 'griffith': 28, 'ea': 21, 'hondras': 30, 'kane': 34, 'ej': 22, 'lemeunier': 39, 'mayer': 41, 'mmopelwa': 43, 'modic': 44, 'moss': 45, 'mullerpatan': 46, 'muteti': 47, 'mwaniki': 48, 'ngandeu': 49, 'singwe': 54, 'outerbridge': 51, 'randhawa': 52, 'shearer': 53, 'sönmez': 55, 'torres': 56, 'verville': 57, 'vlok': 58, 'watters': 59, '3rd': 0, 'wong': 60, 'cc': 13, 'yu': 61}

`

Uses probability of observing a certain piece of text and 0.45 means that that name is for a common collaborator. I'll write better code that can allow you to filter and know the name of the person better. How's that?

kipkurui commented 5 years ago

This sounds like a good idea. What is the input? Authors of the paper? In terms of collaboration trends, we are more interested in inter-institutional collaborations.

Shuyib commented 5 years ago

Link Then take the author list column. I'll refine this better. Though it will require some digging to find out the institutions the authors hence, finding out inter-institutional collaborations...

Shuyib commented 5 years ago

Actually I have another method I want to try will give feedback soon.

kipkurui commented 5 years ago

Great, I look forward to it.

Shuyib commented 4 years ago

Sorry I forgot about this. I just rewrote it. It is revealing trends already in collaboration especially with the author Obonyo M from preliminary results. I need help going through the first 20 TFIDF weight matrices vs the name it belonged to initially.

kipkurui commented 4 years ago

Hi @Shuyib Great work done, thanks. Was looking at this but I find it hard to follow since there is so much data, would be great to show a few rows? We also need to find a way of visualising the output.

kipkurui commented 4 years ago

Also @Shuyib, please send a pull request of any branch you feel is ready to combine with the master. Let's remove feature branches and harmonize the repo.

Shuyib commented 4 years ago

Ok, thanks. Actually it's a one liner to visualize it. I'll update the master on Friday or Saturday hopefully and see a few rows like you've pointed out.

Shuyib commented 4 years ago

@kipkurui There's a HTML file I provided with the plots and the dataframe partially called output_author+institution.html. Did you have a look at it?

Shuyib commented 4 years ago

@kipkurui I have sampled the rows and cleaned up the notebook. For some reason I thought my branch was absent and made another one. Please review the branch collab-graphs.

kipkurui commented 4 years ago

Thanks Ben, let me have a look

On Thu, May 7, 2020 at 9:36 AM Ben Mainye notifications@github.com wrote:

@kipkurui https://github.com/kipkurui I have sampled the rows and cleaned up the notebook. For some reason I thought my branch was absent and made another one. Please review the branch collab-graphs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioinfoNet/Data-mining/issues/11#issuecomment-625059755, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4WPT3QBYQVFJA6CXD5DHLRQJJH5ANCNFSM4FZWBPEA .

-- ....... Caleb Kipkurui Kibet (PhD) Bioinformatics

[image: --]

Caleb kibet [image: https://]about.me/kipkurui https://about.me/kipkurui?promo=email_sig “The pessimist complains about the wind; the optimist expects it to change; the realist adjusts the sails.”

- William Arthur Ward