ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

empty folders created #45

Open chreman opened 9 years ago

chreman commented 9 years ago

in cases where neither pdf nor xml are found, folders are created anyway. this may be irritating when interpreting results and working with e.g. norma and other tools

petermr commented 9 years ago

I will review with the team tomorrow what the best strategies are for creating new ctree/cmdirs . It may require a specific command.

rossmounce commented 9 years ago

I'm not hugely against empty folders tbh. It's a visible reminder of a paper or patent that matches the getpapers search, something more easily/quickly seen than making a human readable version of the JSON file (btw jq which can or is typically used to do this is not installed in the VM).

Are the empty folders inconsistent? What problems does it cause?

petermr commented 9 years ago

We have not fully described what a CTree directory SHOULD or MUST look like. The current approach is that we have a metadata.json file, but that hasn't been added yet. So I would argue that a CTree MUST have metadata.json file which acts (a) as a marker that this is a CTree and (b) a log of what has been done (c) what the contents currently are.

On Sun, Jul 12, 2015 at 12:43 PM, Ross Mounce notifications@github.com wrote:

I'm not hugely against empty folders tbh. It's a visible reminder of a paper or patent that matches the getpapers search, something more easily/quickly seen than making a human readable version of the JSON file (btw jq which can or is typically used to do this is not installed in the VM).

Are the empty folder inconsistent? What problems does it cause?

— Reply to this email directly or view it on GitHub https://github.com/ContentMine/getpapers/issues/45#issuecomment-120710817 .

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

chreman commented 9 years ago

It causes problems when using norma, which gives e.g. 245 [main] DEBUG org.xmlcml.cmine.args.DefaultArgProcessor - ... No reserved files or directories: dir: dinosaurs-eumpc/PMC3633922 when I remove all empty folders, norma runs. so either norma accepts empty folders (but then we have a problem with the definition of minimum ctree, because what should norma put into this folder? could also leave it continously empty), or getpapers creates no empty folders

petermr commented 9 years ago

Is the Norma message an inconvenience or a This means we need metadata.json or similar as a priority for identifying a ctree. So getpapers should really create this file.

chreman commented 9 years ago

Ah yes, also because quickscrape creates a result.json in each ctree, and getpapers an apiname_results.json, which gets overwritten with each search.