ContentMine / cproject

ArgProcessor and files for basic CMDirectories. Often subclassed. Needs to be separate from euclid and norma
Apache License 2.0
0 stars 4 forks source link

CTrees without (reserved) child files #14

Open petermr opened 8 years ago

petermr commented 8 years ago

[See also https://github.com/ContentMine/cmine/issues/10 ]

Until recently CTrees were generated either locally or through getpapers or quickscrape. The automatically generated files contain at least one reserved file such as fulltext.pdf and this was used by CMine software to determine which directories in a CProject are actually CTrees. This was always recognised to be a heuristic, and recently with bulk download of metadata from Crossref we see many potential CTree without reserved files or even without any files. Here's a simple example:

├── PMC4678086
│   ├── eupmc_result.json
│   ├── fulltext.pdf
│   └── fulltext.xml
├── http_dx.doi.org_10.1001_jama.2016.7992
│   └── results.json
└── http_dx.doi.org_10.1007_s13201-016-0429-9

The first directory is retrieved by quickscrape from EPMC and the heuristics indicate it to be a potential CTree. The other two are retrieved from getpapers on Crossref followed by quickscrape which creates only metadata but currently are not flagged as CTrees. The empty directory is created (I think) by quickscrape which then fails to retrieve anything.

The original motivation for the heuristics is that we may introduce new reserved directories into a CProject and users might also introduce non-ctree directories. There was also the idea that we have a reserved file (e.g. metadata.json or log.xml) in any CTree directory`. At present I favour this, and we should discuss what is in it.

Currently I have added a switch

        cProject.setTreatAllChildDirectoriesAsCTrees(true);

which allows users to toggle this behaviour. I will also add results.json to the reserved files which flag "Ctree-ness".