Open tarrow opened 8 years ago
An example output folder after scraping two random papers one open access one not.
output
├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085
│ └── results.json
└── https_elifesciences.org_content_5_e10647v3
├── fulltext.pdf
├── fulltext.xml
└── results.json
An example output folder showing two papers from EuPMC. Notice that in this case the results file eupmc_results.json
is a big blob for the whole project rather than one in each folder like in quickscrape. I'm not sure if this is a good thing because it means that the folder per-paper structure is missing a chunk on information. It has to be kept in the right CProject so that this data isn't lost. This could all be split up placed in a results.json
like quickscrape. It also means that the data is stored in a file which changes name based upon which api it was obtained from.
output
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC4683095
│ └── fulltext.xml
├── PMC4690148
└── fulltext.xml
Agreed this is important. The key thing is to find unique ids if possible. This is not always easy. Quickscale uses URLs and this will clearly not uniquify.
Getpapers is normally pointed at a repository , but not sure if, say arxiv has unique ids (remember DOIs don't work for all documents)
Quickscrape can now also use incrementing integers as the directory names.
Arxiv does have unique ids, e.g. in http://arxiv.org/pdf/1601.00900v1.pdf the id is 1601.00900v1
(and the v1
part is the version)
Norma can read input from files ending in the following extensions:
html,hocr.html,svg,pdf,xml,xhtml
I think it can currently read from whatever file is specified with -i but it may fail if is doesn't meet a complex criteria called reserved name. This means that the file must either be named one of the following:
abstract.html,empty.xml,fulltext.docx,fulltext.html,fulltext.pdf,fulltext.pdf.txt,fulltext.tex,fulltext.tex.html,fulltext.txt,fulltext.txt.html,fulltext.xhtml,fulltext.xml,log.xml,results.json,results.xml,results.html,scholarly.html
or it must be a file of any name within a reversed directory with a name from this list:
image, results, pdf, supplementary, svg
I think this means that one could pass a command like norma -i image/myamazingpaper.pdf -o fulltext.pdf.txt --transform pdf2txt
and it would be valid but norma -i myamazingpaper.pdf -o fulltext.pdf.txt --transform pdf2txt
would fail.
In terms of nlm2html Norma will happily output to where ever you ask it to with what ever extension you desire. For example fulltextbloog.hftml
is perfectly acceptable.
Just using the new cmine
command we have an expected input file called scholarly.html.
This give us many things as an output into the CProjects CTrees: For example:
├── PMC4831192
│ ├── fulltext.xml
│ ├── gene.human.count.xml
│ ├── gene.human.snippets.xml
│ ├── results
│ │ ├── gene
│ │ │ └── human
│ │ │ └── empty.xml
│ │ ├── sequence
│ │ │ └── dnaprimer
│ │ │ └── empty.xml
│ │ ├── species
│ │ │ ├── binomial
│ │ │ │ └── results.xml
│ │ │ └── genus
│ │ │ └── empty.xml
│ │ └── word
│ │ └── frequencies
│ │ ├── results.html
│ │ └── results.xml
│ ├── scholarly.html
│ ├── sequence.dnaprimer.count.xml
│ ├── sequence.dnaprimer.snippets.xml
│ ├── species.binomial.count.xml
│ ├── species.binomial.snippets.xml
│ ├── species.genus.count.xml
│ ├── species.genus.snippets.xml
│ ├── word.frequencies.count.xml
│ └── word.frequencies.snippets.xml
├── sequence.dnaprimer.count.xml
├── sequence.dnaprimer.documents.xml
├── sequence.dnaprimer.snippets.xml
├── species.binomial.count.xml
├── species.binomial.documents.xml
├── species.binomial.snippets.xml
├── species.genus.count.xml
├── species.genus.documents.xml
├── species.genus.snippets.xml
├── word.frequencies.count.xml
├── word.frequencies.documents.xml
└── word.frequencies.snippets.xml
We can use the 'old style' ami commands. For example ami2-gene --g.gene --g.type human --project malaria
to perform an analysis which results in a tree like.
└── PMC4831192
├── fulltext.xml
├── results
│ └── gene
│ └── human
│ └── empty.xml OR results.xml
└── scholarly.html
We don't normally get snippets or summaries running in this way. The way of generating summaries mentioned in the workshop notes no longer works (--analyze
is an unknown argument). However we can generate a snippets file using a command of the form:
ami2-sequence --filter file\(\*\*/results.xml\) --project malaria2 -o summaryfile.xml
However it looks like typically we make a plugin.option.snippets.xml
when this kind of code is called from cmine or from some of the other argProcessors.
This is then processed into summary files with the snippetsfiles as the input and typically a plugin.option.count.xml
and plugin.option.documents.xml
. I'm not actually sure exactly what these are. I think they are total repeats of a given match and the number of documents that contained one or more of those matches.
This simply looks for the following files:
PaperFolder/scholarly.html
and PaperFolder/results/plugin/option/results.xml
however it uses the terminology 'type' rather than 'option'.
Seems to look for any xml file which contains anywhere in its path either 'frequency' or 'binomial'.
It writes to these files (to quote):
The program will write three JSON files containing nodes and edge lists:
words.json -- the top X most frequent words and the articles in which they appear words_tdidf.json -- same as above but calculated using term frequency-inverse document frequency (TF-IDF) species.json -- occurrences of binomial species names and the articles in which they appear
I think throughout this we have found three different classes of files in the CProject/CTree as people currently use it. We have:
We should keep information that is tied to just one paper, and only depends on that one paper in a single paper folder as much as possible. I think if possible we should include the bibliographic metadata that we get from quickscrape and getpapers in the paper folder rather than globally. The python CProject parses the scholarly html to extract this bibliographic metadata and could also save it here if wanted.
Information that is dependent on more than one paper should be kept outside of these paper folders (obviously). But it still needs to be ordered in some way. Peter suggests in #5 that we have a summary folder in the root. I think this is a good idea. The more we put into the root folder the more we have to figure out what is and isn't a paper, or paper folder.
IMHO the logical rule is to have ami, or other tools that are (and only are) extracting facts from papers placing data into project/paper/results/plugin[/option]/results.xml. To be honest I don't really see we need to restrict it to results.xml (for example bag-of-words makes html). Could be JSON, plain text, whatever you want.
We should then have all CProject wide summaries written to a summary folder in the root of the CProject. For example /project/summary/summarisername/projectWordCloud.html, /project/summary/summarisername2/full.dataTables.html and so on.
All that matters, I think, is that we don't allow people to have plugins with the same name (or they can trample on each other's data). Similarly we don't want a summarisername to be duplicated for this reason.
We can then allow people to claim plugin names and state (to whatever level of precision they like) what the plugin will read and what it will output. This perhaps doesn't need to be done programatically but there should be somewhere we say what can do what.
The same can be true of summarisers: they can only summarise the output of certain plugin(s). Currently some bits of ami do summaries but I think that they should not and ought to be pulled out into another program.
Also, I'm not sure if it will link him in but I noticed that @robintw seems to have been writing some python code to read cProjects. I will try and get in contact to find out what structure he currently relies on and so on / if he has suggestions.
Peter and I had a chat today and we proposed something like this as a layout for the CProject
exampleCProject
├── PMC1234
│ ├── metadata
│ │ ├── article
│ │ │ └── metadata.xml
│ │ ├── audit
│ │ │ └── metadata.xml
│ │ ├── derived
│ │ │ └── metadata.xml
│ │ └── source
│ │ └── metadata.xml
│ └── results
│ └── plugin
│ └── option
│ └── results.xml
└── summary
└── summariser
├── full.dataTables.html
└── plugin
└── option
└── snippets.xml
Thanks for looping me in on this.
At the moment, my Python code just depends on the output folder structure produced by quickscrape
. That is (copied from above):
output
├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085
│ └── results.json
└── https_elifesciences.org_content_5_e10647v3
├── fulltext.pdf
├── fulltext.xml
└── results.json
It assumes that all folders beneath output
refer to a paper, and looks for results.json
and then fulltext.xml
, fulltext.pdf
or scholarly.html
inside each folder. If it can't find any of these then it just skips that folder.
I'm a little confused by @tarrow's most recent post - I can't quite see how that links to the quickscrape output structure. I assume PMC1234
is a paper, but if so, where would the fulltext.*
files go? And also, is there a particular reason for switching from JSON to XML for the metadata? I personally find JSON a lot nicer...
Typical console output from getpapers
is:
localhost:projects pm286$ getpapers -q "sula bassana" -o sula -x -k 100
info: Searching using eupmc API
info: Found 4 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Got XML URLs for 4 out of 4 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (4/4) [0.2s elapsed, eta 0.0]
info: All XML downloads succeeded!
Proposal: we should capture much of this in audit
@robintw Thanks
I think we'll probably use metadata.json
. We might also include metadata.xml
as a syntactic variant.
@robintw
I think @tarrow missed out the fulltext.*
children of PMC1234
:-)
I think we're close to something workable. The naming of things needs some attention - it needs to be clear what's what if these are sitting in the normal filesystem. Also we should minimise the verbosity and the depth of the directory structure.
Using the example you gave above @tarrow, and for now just focusing on 'metadata'. The original is:
exampleCProject
├── PMC1234
│ ├── metadata
│ │ ├── article
│ │ │ └── metadata.xml
│ │ ├── audit
│ │ │ └── metadata.xml
│ │ ├── derived
│ │ │ └── metadata.xml
│ │ └── source
│ │ └── metadata.xml
First thing, I think we should eliminate the redundant directories - just name the files:
exampleCProject
├── PMC1234
│ ├── metadata
│ │ ├── article.json
│ │ ├── audit.json
│ │ ├── derived.json
│ │ └──source.json
Secondly, what is supposed to be in these files? The names don't suggest anything meaningful to me.
It seems to me that we only need to capture:
In both cases there may be multiple files, so how about just:
exampleCProject
├── PMC1234
│ ├── metadata
│ │ ├── bib.json
│ │ └── citeproc.json
│ └── .logs
│ ├── 2016-04-26T15:01:18.133Z.json
│ ├── 2016-04-26T15:02:17.432Z.json
│ └── 2016-04-26T15:14:54.774Z.json
Where:
metadata
is the bibliographic metadata, in whatever formats (e.g. bibJSON or citeprocJSON).logs
is a hidden directory containing JSON files, each one a log output by any program that operates on the item, and with ISO 8601 timestamps as the filenames. The object in any given file can contain details including the program/library and version that performed the operation, the command used (if applicable), the individual log events, whether any errors occurred, etc.I like the most recent suggestion from @blahah, particularly the specific names for the json files and the .logs
directory.
Explanation of metadata:
audit
is an audit trail of the operations used to create information. Which program was run on which data and parameters against which target (URL). Or was it scraped from user disk? This is currently lost; we don't know where the cproject
came from. Discuss: do we ever wish to rerun a getpapers
? if so do we replace or overwrite?source
is the metadata emitted by the repo (if any). This is eumpc_results.json
for EPMC. It is what the REPO thinks we need to know. It overlaps slightly with audit
and significantly (but possible contradicts) article
(or perhaps better document
).article
(I now prefer document
) is the metadata extracted from the document. It might be the <head>
in the HTML file or <front>
in JATS. It is what the author and publisher want us to know.derived
is metadata calculated by norma
and ami
. It might be the number of figures (which may not be in source
or document
. It could be the list of supplemental files, or possibly even describe scientific content such as maths or chemistry.I suggested directories rather than names as we may wish to group information. I would certainly want both metadata.xml
as well as metadata.json
to support stylesheets and XPath. The two would be semantically equivalent. I shan't fight for directories but I will fight for XML
OK, so I think audit
should be .logs
- it's easier to understand what it does, and hidden because almost all users won't want or need to see it.
source
, article
and derived
can all go in a single directory. I don't think it matters where the metadata comes from, it just needs to be resolved into a single record per article. Things like figure count can all go in the bibJSON (or whatever) along with other metadata.
agree about .logs
.
agree about BibJSON.
probably agree about the rest. I am mainly concerned about what we might get from repos other than EPMC and don't want to limit.
regarding the pyCProject naming: I took type
as this was the convention used when calling ami at that time, e.g. --sq.sequence --sq.type rna
, but as this changes or we agree on a new convention I can adapt the code.
There is also the question of multiple files of the same type - the
commonest are images or figures) and tables. There is also supplemental
data. I have introduced in the dev
branchL
table/
which contains whatever the paper contains e.g. table1.csv
,
tab2.html
, 'data3.xls. Note that tables may be transformed between
formats and this directory structure allows this (e.g. table1.csv
can be
converted to table1.html
)image/
- or figure
might be better. Similar to tablethere should also be supplemental
, etc. (I think there are rudimentary
stubs for this)
On Tue, May 17, 2016 at 5:30 PM, Christopher Kittel < notifications@github.com> wrote:
regarding the pyCProject naming: I took type as this was the convention used when calling ami at that time, e.g. --sq.sequence --sq.type rna, but as this changes or we agree on a new convention I can adapt the code.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/10#issuecomment-219774952
Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069
There is a problem with empty directories created by quickscrape
when downloads don't retrieve anything. See https://github.com/ContentMine/cmine/issues/14
I'm making this issue to try and formalise what should and shouldn't be in a CProject. Since the main interface between parts of the software is the filesystem tree of a CProject there needs to be a standard so that people can write other programs to interface with it.
This is also an exercise in trying to keep our development decisions more open; if we want to attract outside contributors the decisions process needs to be as transparent as possible as well as soliciting input from anyone interested.
In the following comments I'm setting out what I think the current position is from the various bits of software, specifically, quickscrape, getpapers, norma, ami and the python CProject library. Then I'll make some suggestions as to how I think it should be laid out. Please make any suggestions you think are important. It's much better that we get this design right now rather than do it quickly.