ContentMine / cproject

ArgProcessor and files for basic CMDirectories. Often subclassed. Needs to be separate from euclid and norma
Apache License 2.0
0 stars 4 forks source link

Formalise the Structure of a CProject #10

Open tarrow opened 8 years ago

tarrow commented 8 years ago

I'm making this issue to try and formalise what should and shouldn't be in a CProject. Since the main interface between parts of the software is the filesystem tree of a CProject there needs to be a standard so that people can write other programs to interface with it.

This is also an exercise in trying to keep our development decisions more open; if we want to attract outside contributors the decisions process needs to be as transparent as possible as well as soliciting input from anyone interested.

In the following comments I'm setting out what I think the current position is from the various bits of software, specifically, quickscrape, getpapers, norma, ami and the python CProject library. Then I'll make some suggestions as to how I think it should be laid out. Please make any suggestions you think are important. It's much better that we get this design right now rather than do it quickly.

tarrow commented 8 years ago

Quickscrape

An example output folder after scraping two random papers one open access one not.

output
├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085
│   └── results.json
└── https_elifesciences.org_content_5_e10647v3
    ├── fulltext.pdf
    ├── fulltext.xml
    └── results.json
tarrow commented 8 years ago

Getpapers

An example output folder showing two papers from EuPMC. Notice that in this case the results file eupmc_results.json is a big blob for the whole project rather than one in each folder like in quickscrape. I'm not sure if this is a good thing because it means that the folder per-paper structure is missing a chunk on information. It has to be kept in the right CProject so that this data isn't lost. This could all be split up placed in a results.json like quickscrape. It also means that the data is stored in a file which changes name based upon which api it was obtained from.

output
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC4683095
│   └── fulltext.xml
├── PMC4690148
     └── fulltext.xml
petermr commented 8 years ago

Agreed this is important. The key thing is to find unique ids if possible. This is not always easy. Quickscale uses URLs and this will clearly not uniquify.

Getpapers is normally pointed at a repository , but not sure if, say arxiv has unique ids (remember DOIs don't work for all documents)

blahah commented 8 years ago

Quickscrape can now also use incrementing integers as the directory names.

Arxiv does have unique ids, e.g. in http://arxiv.org/pdf/1601.00900v1.pdf the id is 1601.00900v1 (and the v1 part is the version)

tarrow commented 8 years ago

Norma Input

Norma can read input from files ending in the following extensions:

html,hocr.html,svg,pdf,xml,xhtml

I think it can currently read from whatever file is specified with -i but it may fail if is doesn't meet a complex criteria called reserved name. This means that the file must either be named one of the following:

abstract.html,empty.xml,fulltext.docx,fulltext.html,fulltext.pdf,fulltext.pdf.txt,fulltext.tex,fulltext.tex.html,fulltext.txt,fulltext.txt.html,fulltext.xhtml,fulltext.xml,log.xml,results.json,results.xml,results.html,scholarly.html

or it must be a file of any name within a reversed directory with a name from this list:

image, results, pdf, supplementary, svg

I think this means that one could pass a command like norma -i image/myamazingpaper.pdf -o fulltext.pdf.txt --transform pdf2txt and it would be valid but norma -i myamazingpaper.pdf -o fulltext.pdf.txt --transform pdf2txt would fail.

tarrow commented 8 years ago

Norma Output

In terms of nlm2html Norma will happily output to where ever you ask it to with what ever extension you desire. For example fulltextbloog.hftml is perfectly acceptable.

tarrow commented 8 years ago

Ami

Just using the new cmine command we have an expected input file called scholarly.html.

This give us many things as an output into the CProjects CTrees: For example:

├── PMC4831192
│   ├── fulltext.xml
│   ├── gene.human.count.xml
│   ├── gene.human.snippets.xml
│   ├── results
│   │   ├── gene
│   │   │   └── human
│   │   │       └── empty.xml
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │       └── empty.xml
│   │   ├── species
│   │   │   ├── binomial
│   │   │   │   └── results.xml
│   │   │   └── genus
│   │   │       └── empty.xml
│   │   └── word
│   │       └── frequencies
│   │           ├── results.html
│   │           └── results.xml
│   ├── scholarly.html
│   ├── sequence.dnaprimer.count.xml
│   ├── sequence.dnaprimer.snippets.xml
│   ├── species.binomial.count.xml
│   ├── species.binomial.snippets.xml
│   ├── species.genus.count.xml
│   ├── species.genus.snippets.xml
│   ├── word.frequencies.count.xml
│   └── word.frequencies.snippets.xml
├── sequence.dnaprimer.count.xml
├── sequence.dnaprimer.documents.xml
├── sequence.dnaprimer.snippets.xml
├── species.binomial.count.xml
├── species.binomial.documents.xml
├── species.binomial.snippets.xml
├── species.genus.count.xml
├── species.genus.documents.xml
├── species.genus.snippets.xml
├── word.frequencies.count.xml
├── word.frequencies.documents.xml
└── word.frequencies.snippets.xml

We can use the 'old style' ami commands. For example ami2-gene --g.gene --g.type human --project malaria to perform an analysis which results in a tree like.

└── PMC4831192
    ├── fulltext.xml
    ├── results
    │   └── gene
    │       └── human
    │           └── empty.xml OR results.xml
    └── scholarly.html

We don't normally get snippets or summaries running in this way. The way of generating summaries mentioned in the workshop notes no longer works (--analyze is an unknown argument). However we can generate a snippets file using a command of the form: ami2-sequence --filter file\(\*\*/results.xml\) --project malaria2 -o summaryfile.xml However it looks like typically we make a plugin.option.snippets.xml when this kind of code is called from cmine or from some of the other argProcessors.

This is then processed into summary files with the snippetsfiles as the input and typically a plugin.option.count.xml and plugin.option.documents.xml. I'm not actually sure exactly what these are. I think they are total repeats of a given match and the number of documents that contained one or more of those matches.

tarrow commented 8 years ago

CProject python library

This simply looks for the following files: PaperFolder/scholarly.html

and PaperFolder/results/plugin/option/results.xml

however it uses the terminology 'type' rather than 'option'.

tarrow commented 8 years ago

Ami summarise (https://github.com/matthewgthomas/ami-summarise)

Seems to look for any xml file which contains anywhere in its path either 'frequency' or 'binomial'.

It writes to these files (to quote):

The program will write three JSON files containing nodes and edge lists:

words.json -- the top X most frequent words and the articles in which they appear words_tdidf.json -- same as above but calculated using term frequency-inverse document frequency (TF-IDF) species.json -- occurrences of binomial species names and the articles in which they appear

tarrow commented 8 years ago

Tom's Summary

I think throughout this we have found three different classes of files in the CProject/CTree as people currently use it. We have:

  1. The text, images, raw content of the paper in various forms. (pdf, txt, scholarly html)
  2. Data about the papers (bibliographic metadata, facts - 'this species was here surrounded by this', 'this image had something that looks like a diagram of methane in it etc..'
  3. Summaries of this data (across the whole project)

We should keep information that is tied to just one paper, and only depends on that one paper in a single paper folder as much as possible. I think if possible we should include the bibliographic metadata that we get from quickscrape and getpapers in the paper folder rather than globally. The python CProject parses the scholarly html to extract this bibliographic metadata and could also save it here if wanted.

Information that is dependent on more than one paper should be kept outside of these paper folders (obviously). But it still needs to be ordered in some way. Peter suggests in #5 that we have a summary folder in the root. I think this is a good idea. The more we put into the root folder the more we have to figure out what is and isn't a paper, or paper folder.

IMHO the logical rule is to have ami, or other tools that are (and only are) extracting facts from papers placing data into project/paper/results/plugin[/option]/results.xml. To be honest I don't really see we need to restrict it to results.xml (for example bag-of-words makes html). Could be JSON, plain text, whatever you want.

We should then have all CProject wide summaries written to a summary folder in the root of the CProject. For example /project/summary/summarisername/projectWordCloud.html, /project/summary/summarisername2/full.dataTables.html and so on.

All that matters, I think, is that we don't allow people to have plugins with the same name (or they can trample on each other's data). Similarly we don't want a summarisername to be duplicated for this reason.

We can then allow people to claim plugin names and state (to whatever level of precision they like) what the plugin will read and what it will output. This perhaps doesn't need to be done programatically but there should be somewhere we say what can do what.

The same can be true of summarisers: they can only summarise the output of certain plugin(s). Currently some bits of ami do summaries but I think that they should not and ought to be pulled out into another program.

tarrow commented 8 years ago

Also, I'm not sure if it will link him in but I noticed that @robintw seems to have been writing some python code to read cProjects. I will try and get in contact to find out what structure he currently relies on and so on / if he has suggestions.

tarrow commented 8 years ago

Peter and I had a chat today and we proposed something like this as a layout for the CProject

exampleCProject
├── PMC1234
│   ├── metadata
│   │   ├── article
│   │   │   └── metadata.xml
│   │   ├── audit
│   │   │   └── metadata.xml
│   │   ├── derived
│   │   │   └── metadata.xml
│   │   └── source
│   │       └── metadata.xml
│   └── results
│       └── plugin
│           └── option
│               └── results.xml
└── summary
    └── summariser
        ├── full.dataTables.html
        └── plugin
            └── option
                └── snippets.xml
robintw commented 8 years ago

Thanks for looping me in on this.

At the moment, my Python code just depends on the output folder structure produced by quickscrape. That is (copied from above):

output
├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085
│   └── results.json
└── https_elifesciences.org_content_5_e10647v3
    ├── fulltext.pdf
    ├── fulltext.xml
    └── results.json

It assumes that all folders beneath output refer to a paper, and looks for results.json and then fulltext.xml, fulltext.pdf or scholarly.html inside each folder. If it can't find any of these then it just skips that folder.

I'm a little confused by @tarrow's most recent post - I can't quite see how that links to the quickscrape output structure. I assume PMC1234 is a paper, but if so, where would the fulltext.* files go? And also, is there a particular reason for switching from JSON to XML for the metadata? I personally find JSON a lot nicer...

petermr commented 8 years ago

Typical console output from getpapers is:

localhost:projects pm286$ getpapers -q "sula bassana" -o sula -x -k 100
info: Searching using eupmc API
info: Found 4 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Got XML URLs for 4 out of 4 results
info: Downloading fulltext XML files
Downloading files [==============================] 100% (4/4) [0.2s elapsed, eta 0.0]
info: All XML downloads succeeded!

Proposal: we should capture much of this in audit

petermr commented 8 years ago

@robintw Thanks I think we'll probably use metadata.json. We might also include metadata.xml as a syntactic variant.

petermr commented 8 years ago

@robintw I think @tarrow missed out the fulltext.* children of PMC1234 :-)

blahah commented 8 years ago

I think we're close to something workable. The naming of things needs some attention - it needs to be clear what's what if these are sitting in the normal filesystem. Also we should minimise the verbosity and the depth of the directory structure.

Using the example you gave above @tarrow, and for now just focusing on 'metadata'. The original is:

exampleCProject
├── PMC1234
│   ├── metadata
│   │   ├── article
│   │   │   └── metadata.xml
│   │   ├── audit
│   │   │   └── metadata.xml
│   │   ├── derived
│   │   │   └── metadata.xml
│   │   └── source
│   │       └── metadata.xml

First thing, I think we should eliminate the redundant directories - just name the files:

exampleCProject
├── PMC1234
│   ├── metadata
│   │   ├── article.json
│   │   ├── audit.json
│   │   ├── derived.json
│   │   └──source.json

Secondly, what is supposed to be in these files? The names don't suggest anything meaningful to me.

It seems to me that we only need to capture:

In both cases there may be multiple files, so how about just:

exampleCProject
├── PMC1234
│   ├── metadata
│   │   ├── bib.json
│   │   └── citeproc.json
│   └── .logs
│       ├── 2016-04-26T15:01:18.133Z.json
│       ├── 2016-04-26T15:02:17.432Z.json
│       └── 2016-04-26T15:14:54.774Z.json

Where:

robintw commented 8 years ago

I like the most recent suggestion from @blahah, particularly the specific names for the json files and the .logs directory.

petermr commented 8 years ago

Explanation of metadata:

I suggested directories rather than names as we may wish to group information. I would certainly want both metadata.xml as well as metadata.json to support stylesheets and XPath. The two would be semantically equivalent. I shan't fight for directories but I will fight for XML

blahah commented 8 years ago

OK, so I think audit should be .logs - it's easier to understand what it does, and hidden because almost all users won't want or need to see it.

source, article and derived can all go in a single directory. I don't think it matters where the metadata comes from, it just needs to be resolved into a single record per article. Things like figure count can all go in the bibJSON (or whatever) along with other metadata.

petermr commented 8 years ago

agree about .logs.

agree about BibJSON.

probably agree about the rest. I am mainly concerned about what we might get from repos other than EPMC and don't want to limit.

chreman commented 8 years ago

regarding the pyCProject naming: I took type as this was the convention used when calling ami at that time, e.g. --sq.sequence --sq.type rna, but as this changes or we agree on a new convention I can adapt the code.

petermr commented 8 years ago

There is also the question of multiple files of the same type - the commonest are images or figures) and tables. There is also supplemental data. I have introduced in the dev branchL

there should also be supplemental , etc. (I think there are rudimentary stubs for this)

On Tue, May 17, 2016 at 5:30 PM, Christopher Kittel < notifications@github.com> wrote:

regarding the pyCProject naming: I took type as this was the convention used when calling ami at that time, e.g. --sq.sequence --sq.type rna, but as this changes or we agree on a new convention I can adapt the code.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/ContentMine/cmine/issues/10#issuecomment-219774952

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

petermr commented 8 years ago

There is a problem with empty directories created by quickscrape when downloads don't retrieve anything. See https://github.com/ContentMine/cmine/issues/14