Open danmaclean opened 6 years ago
@danmaclean also see: https://github.com/ContentMine/cproject/issues/10 .
Dan, First - many thanks for working on ContentMine and happy to talk more about your requirements and interests.
CM data structure is intentionally somewhat fluid because we are reacting to the very wide range of structures and information that people use in scientific communication. The philosophy is perhaps similar to JSON and other lightly typed structures rather than the rigidity of XML schemas and DTDs.
In the case you give the names rna
, dnaprimer
, etc are determined by the dictionaries or query types that are used in the query. The first query will have been for any of sequence(rna, dna, prot)
while the second was for sequence(dnaprimer)
. This means that the names of the directories depend on the query - most are optional and may be set by the users choice of dictionaries. If I use a dictionaries 'foo.xml' and bar.xml
then the output will be of the form:
│ ├── results
│ │ ├── dict
│ │ │ └── foo
│ │ │ └── empty.xml
│ │ │ └── bar
│ │ │ └── empty.xml
...
This means that a parser will have fewer hard coded names and more that are determined at runtime.
I think that JSON is a good analogy here (and indeed the output could be transformed into JSON). It makes parsing more challenging than hardcoded names and means that tools such as XPath and JSONPath are often useful.
(The info is probably also out of date in places - sorry! but that is often the case with evolving projects.).
Hi,
This document https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/cproject seems to claim to be definitive about Cproject structure, but seems to be at odds with this document about the output of
ami
https://github.com/ContentMine/workshop-resources/blob/master/software-tutorials/ami/README.md#ami2-species. In the CProject definition the extent of say asequence
results directory looks to be much simpler than the apparent results described in the tutorial.CProject folder structure:
ami output tutorial
Im trying to write a parser for CProjects, could you let me know whether the
ami
tools are going to produce lots of directories (e.gami2seq
will generatesequence/sequencetype
folders or, as the CProject document suggests, will it generate just thesequence/dnaprimer
folder? Or is the info in one of these docs out of date?Thanks for clarification.