ContentMine / workshop-resources

This repository contains material helping you to set up a ContentMine workshop. It also includes tutorials for learning the ContentMine tools on your own.
Other
37 stars 13 forks source link

Contentmine pipeline #63

Open alexmaina opened 7 years ago

alexmaina commented 7 years ago

I have a database with a list of PMID's. I want to mine the text in all openaccess articles in this list of PMIDs and get the most frequent used terms/keywords/subject.

I have tested getpapers and seen how powerful and efficient it is in getting papers. I have then moved on to quickscrape and tried downloading pdf's based on the url list in the _eupmc_fulltext_html_urls.tx_t that getpapers outputs.

Seeing that i can use -p command in a getpapers query to download pdf's, my question is why should i use quickscrape? Also, after watching this video from the 1.29 minute mark, Peter-Murray is able to skim through pdfs quite easily. How does he do that? I am using an Ubuntu 14.04 Lts box how can i skim through pdfs like that using Ubuntu? Still on the video, at the 2:23 minute mark, Peter-Murray writes what seems like Java code to filter the files for sequences and keyterms. Which tool is he using to do that? Is it part of the ContentMine API? I am not sure if what i have written above qualifies to be an issue but i am really keen to understand ContentMine and how best i can use it for my project.

Thanks

AM