ContentMine / cproject

ArgProcessor and files for basic CMDirectories. Often subclassed. Needs to be separate from euclid and norma
Apache License 2.0
0 stars 4 forks source link

Ensure SnippetsTrees are written with file = unix filename #7

Open tarrow opened 8 years ago

tarrow commented 8 years ago

results.xml contains lines like:

<results>
<result pre="ric multi-attribute utility values " name0="exclude" value0="exclude" post="important domains and non-health outcomes, while p" xpath="/*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][1]/*[local-name()='div'][3]/*[local-name()='div'][1]/*[local-name()='p'][1]"/>
<results/>

snippetsTrees are elements which contain results elements. Sometimes multiple, sometimes only one. projectSnippetsTrees are elements which contain snippetTree elements. One snippets tree element for each paper that is addressed.

However, we directly build snippetsTrees from results. Indeed the current code in SnippetsTree.java relies on them being precisely saved in a file called results/pluginname/option/results.xml. (see line 107). However this doesn't make sense because snippetsTrees when written to file are written with a name of type: plugin.option.snippets.xml which makes it impossible to read a snippetsTree in from a file and have it as a valid object.

I think this shows where we've introduced two different functions of the ami code that should be more strongly decoupled: 1) mining information from papers and 2) formatting it for human reading.

A machine doesn't really need either the snippetsTree or the projectSnippetsTree. We should probably stop making these (including for the situation where they contain post-processed data from the mine like word counts) and leave it to a tool further down the line.

tarrow commented 8 years ago

I misunderstood this. Actually we don't get the filename from the name of the file it is actually an XML attribute in the snippetsTree. This is obviously written in a platform dependent format which I'll now track down and try to fix.

tarrow commented 8 years ago

We probably don't need to do this if all of the logic is migrated into wanda. ProjectSnippetsTrees are probably all of the summary that we will write. They still do have the file names in them but this is probably not necessary. They are only there because without them we don't know what plugin wrote the results file (this should really be stored in the results element its self).