This repository contains material helping you to set up a ContentMine workshop. It also includes tutorials for learning the ContentMine tools on your own.
Other
37
stars
13
forks
source link
Feedback from 2015-12-10-lifesciences workshop #52
Lots of people wanted to use with existing PDF collections - this could be a specific module/tutorial. Need @petermr input to ensure people know how to try but are aware results could be patchy.
Just a proposal: all the tools can be integrated into a tool sharing platform like 'Docker'. That way users would not have to run a virtual machine or install individual tools.
Not sure if already noted: Powering off VM (e.g. after lunch) or opening editor ('Geany'), will reset keyboard input to non-US
QS/Getpapers
Lots of Qs about APIs/scrapers and links are not obvious without reading through all the software tutorial text. Could have a 'related links' at the top of the repo as well as the bottom or 'Are you looking for?' links.
Trainer note: Need to be clear that while they're welcome to try their own material, they should follow all of the tutorials in order to use the papers for norma/AMI tutorials OR need bundles of files as input and output from each step in case it fails.
Should flag in software tutorials that if they're using the vm they don't need to install the software and can use npm rather than sudo npm
People wanted to work out how many papers they'd found before downloading them: getpapers --noexecute --query 'dinosaurs' etc
If you get too many papers, try reducing the date range to a manageable number (see getpapers and EPMC query syntax). Our experience is ca 300 papers/min.
To kill getpapers - Ctrl-C (maybe twice)
Qs
What is the .json file exactly (results)?
Can we restrict the number of items to download?
Can we use multiple search term Boolean queries to narrow results at getpapers stage?
Norma
Some people found that when running getpapers WITHOUT quickscrape, norma works but running quickscrape between getpapers and norma caused issues.
Qs
Where can we find a list for all the --transform arguments?
Find XML results files with data, group by ami command: ls -lht /////.xml
Lots of participants very quickly wanted to summarise results.xml - could include work by Matthew Thomas after Wellcome workshop (link).
RegEx tutotial: Could you please give example results? I can run up to second last step and get regex and results folders, but all I opened manually are empty. How many results should there be for food?
Documentation
Some specific comments that formatting all commands as code would be useful (I don't know how much code is not formatted as such, it might only have been a few lines)
One participant commented 'Generally got a bit lost navigating between materials on GitHub pages'
Overall feedback (personal, subjective): I got lost often, because I do not have a conceptual map of which step takes what input and creates what output after very simple steps (getpapers, scrape) were done. Vaguely getting there, but pace too fast. I was also often uncertain which directory I need to be in and how to access output. maybe provide hard-copy of pipeline for steps in workshop and vague 'directory' map of where stuff ends up (if the user records steps/doesn't change locations).
One issue: If tutorial was carried out on own search (i.e. not on dinosaurs, you can no longer carry out later tutorials, e.g. I run norma on different topic, now I cannot follow ami2-species as I have no species in my text. Could you provide intermediary output folders, so that if you have to skip forward in tutorials, you can still follow (in future courses)? I have scholarly.html, but that doesn't contain species info.
Overall, instructions are not very clear to me. Some command line commands are in `picture-type format (e.g. in a tree example, others in plain text, they are sequential, so you cannot follow later stages if you get stuck anywhere (not so great).
See repo and pad
General software functionality
QS/Getpapers
Qs
Norma
Qs
AMI
RegEx tutotial: Could you please give example results? I can run up to second last step and get regex and results folders, but all I opened manually are empty. How many results should there be for food?
Documentation