benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

extracting attributes using ``--xml`` option results in confusing output with whitespace #62

Open goekce opened 3 years ago

goekce commented 3 years ago

When I try to extract the attributes using XPath and --xml, I get empty XML output:

$ wget https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.owl

$ xidel -se "//rdfs:label[text()='malignant hyperthermia']/../@rdf:about" doid.owl
http://purl.obolibrary.org/obo/DOID_8545

# extracting a single attribute in XML format does not work.
$ xidel --xml -se "//rdfs:label[text()='malignant hyperthermia']/../@rdf:about" doid.owl
<?xml version="1.0" encoding="UTF-8"?>
<xml>

</xml>

This behavior is confusing when I browse through an XML file. Does it make sense to extract a pseudo tag element which includes the searched attribute?

For example, xmllint outputs attribute="value" when an attribute is addressed.

benibela commented 3 years ago

I should probably change that. However, I cannot decide if it should be attribute="value" or raise an error

I wrote Xidel's output first and later implemented a standard fn:serialize and now I plan to merge them.

attribute="value" would be more useful, but serialize xml gives an error:

$ xidel  -se "serialize(//rdfs:label[text()='malignant hyperthermia']/../@rdf:about, {'method':'adaptive'})" doid.owl
rdf:about="http://purl.obolibrary.org/obo/DOID_8545"
$ xidel  -se "serialize(//rdfs:label[text()='malignant hyperthermia']/../@rdf:about, {'method':'xml'})" doid.owl
Error:
err:SENR0001: Cannot serialize attribute
$ xidel  -se "serialize(//rdfs:label[text()='malignant hyperthermia']/../@rdf:about, {'method':'text'})" doid.owl
Error:
err:SENR0001: Cannot serialize attribute
goekce commented 3 years ago

My impression is --xml tries to output correct XML — that is why xidel always outputs a correct header, even it outputs useless whitespace. That is what you mean with Xidel's output?

If --xml should continue to output correct XML, than it should not output <xml>rdf:about="http://purl.obolibrary.org/obo/DOID_8545"</xml>

However I did not know about fn:serialize! If this function could be easily used via an option (without writing the function and parantheses around the XPath expression), then it would be very convenient for browsing an XML in my opinion.

My workflow:

benibela commented 3 years ago

My impression is --xml tries to output correct XML — that is why xidel always outputs a correct header, even it outputs useless whitespace. That is what you mean with Xidel's output?

Yes

Xidel also has options that serialize does not have, e.g. converting json to xml (which might have been a bad idea due its very non-standard output):

$ xidel --output-format xml-wrapped -e '{"a":1}' 
**** Processing: data:,<empty/> ****
<?xml version="1.0" encoding="UTF-8"?>
<seq>
<e><object><a>1</a></object></e>
</seq>

If --xml should continue to output correct XML, than it should not output rdf:about="http://purl.obolibrary.org/obo/DOID_8545"

However, it is still correct XML even if it is text rather than an attribute

However I did not know about fn:serialize! If this function could be easily used via an option (without writing the function and parantheses around the XPath expression), then it would be very convenient for browsing an XML in my opinion.

There is another standard XQuery way, but it is even worse:

xidel doid.owl -e 'declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization"; declare option output:method "xml"; //rdfs:label/...'

I could predefine the output namespace, but then it is non standard again

goekce commented 3 years ago

--xml should probably continue to output correct XML. If XML can also contain simple text, that would be an option, but it would not be consistent with Xidel's behavior when outputting node elements (where Xidel even additionally appends namespaces to child nodes).

The only idea I have is to introduce another option like --excerpt which outputs only corresponding parts of the read file without the effort that --xml puts into output.