ContentMine / workshop-resources

This repository contains material helping you to set up a ContentMine workshop. It also includes tutorials for learning the ContentMine tools on your own.
Other
37 stars 13 forks source link

Cproject Structure Query #65

Open danmaclean opened 6 years ago

danmaclean commented 6 years ago

Hi,

This document https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/cproject seems to claim to be definitive about Cproject structure, but seems to be at odds with this document about the output of ami https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/ami/README.md#ami2-species. In the CProject definition the extent of say a sequence results directory looks to be much simpler than the apparent results described in the tutorial.

CProject folder structure:

│   ├── results
│   │   ├── sequence
│   │   │   └── dnaprimer
│   │   │       └── empty.xml

ami output tutorial

│   ├── results
│   │   ├── sequence
│   │   │   └── rna
│   │   │       └── empty.xml
│   │   │   └── dna
│   │   │       └── empty.xml
│   │   │   └── prot
│   │   │       └── empty.xml

Im trying to write a parser for CProjects, could you let me know whether the ami tools are going to produce lots of directories (e.g ami2seq will generate sequence/sequencetype folders or, as the CProject document suggests, will it generate just the sequence/dnaprimer folder? Or is the info in one of these docs out of date?

Thanks for clarification.

ghost commented 6 years ago

@danmaclean also see: https://github.com/ContentMine/cproject/issues/10 .

petermr commented 6 years ago

Dan, First - many thanks for working on ContentMine and happy to talk more about your requirements and interests.

CM data structure is intentionally somewhat fluid because we are reacting to the very wide range of structures and information that people use in scientific communication. The philosophy is perhaps similar to JSON and other lightly typed structures rather than the rigidity of XML schemas and DTDs. In the case you give the names rna , dnaprimer, etc are determined by the dictionaries or query types that are used in the query. The first query will have been for any of sequence(rna, dna, prot) while the second was for sequence(dnaprimer). This means that the names of the directories depend on the query - most are optional and may be set by the users choice of dictionaries. If I use a dictionaries 'foo.xml' and bar.xml then the output will be of the form:

│   ├── results
│   │   ├── dict
│   │   │   └── foo
│   │   │       └── empty.xml
│   │   │   └── bar
│   │   │       └── empty.xml
...

This means that a parser will have fewer hard coded names and more that are determined at runtime.

I think that JSON is a good analogy here (and indeed the output could be transformed into JSON). It makes parsing more challenging than hardcoded names and means that tools such as XPath and JSONPath are often useful.

(The info is probably also out of date in places - sorry! but that is often the case with evolving projects.).