copy/paste of queries at the bottom of the query_SDGX.xml pages into Scopus fails

erikkemperman commented 3 years ago

Describe the bug Copy/pasting the entire query at the bottom of the pages into Scopus advanced search gives syntax errors To Reproduce Steps to reproduce the behavior:

Go to https://aurora-network-global.github.io/sdg-queries/query_SDG1.xml
Copy the query at the bottom of the page
Paste it into Scopus advanced search
See error

Expected behavior These queries should work, or the "usage" text should be adapted?

Desktop (please complete the following information):

OS: Linux
Browser Chromium
Version 89.0.4389.82

Additional context The problem seems to be that the concatenation of subqueries with "\nOR\n" is disregarded by the scopus query editor, and it tries to execute queries containing "ORTITLE-ABS-KEY".

mosart commented 3 years ago

Hi Erik,

The orgin for rendering the XML to HTML can be found in the XSL https://github.com/Aurora-Network-Global/sdg-queries/blob/master/queries.xsl

I tried to fix that by adding a non-breaking space after the OR statement in the XSL template. https://github.com/Aurora-Network-Global/sdg-queries/commit/f160f071607a27e17c6614153b2530a638803be9#diff-34b02004b91728a359a521c831ffef74d95aebdd3b9c03788288b8e9aaa4bcb6

However, then I tried this, the rendering of the HTML breaks completely. (I could not test it on my laptop. Somehow a modern version of a web browser does not render xml to html when opened on localhost but only when accessed via https.) So I changed it back.

If you have a -tested- solution for the forced space after the OR statement in the XSL. please help me out.

Warm regards, Maurice

erikkemperman commented 3 years ago

Hi Maurice,

Thanks for taking a look at the issue -- I am not sure how to remedy it I'm afraid, and have changed my approach since I reported this. I am now just parsing the raw XML files and compositing the scopus queries in a Python script. This suits me better anyway, since my goal is to transform the queries to work on Postgres.

Regards, Erik

mosart commented 3 years ago

Nice! that is why we put it in xml, for automation, and human readability. If you want you can let me know more about your project, and perhaps also share the transamination script. ( like IDfuse did for Elastic search DSL. You are working at Erasmus University right?

erikkemperman commented 3 years ago

Yes, I am an RSEC at Erasmus!

Agreed that XML is a nice format for this kind of thing -- although to facilitate translation of the queries to other languages, it might be worthwhile to consider making things a bit finer-grained, and perhaps slightly less Scopus-centric (although I understand those are the origins).

Just as an example,

<aqd:query-line field="TITLE-ABS-KEY">
  ("poverty line*") OR ("poverty indicator*")
</aqd:query-line>

To write a script to translate this to other query languages, I need to parse first the XML and then the Scopus query (for which, to my knowledge, no explicit grammar is publicly available so I've had to cobble something together myself using Antlr).

Suppose, instead, the XML looked something like this:

<aqd:query-line field="TITLE-ABS-KEY">
  <aqg:query-or>
    <aqd:query-parens>
      "poverty line*"
    </aqd:query-parens>
    <aqd:query-parens>
      "poverty indicator*"
    </aqd:query-parens>
  </aqg:query-or>
</aqd:query-line>

That way the tree structure of the query is reflected explicitly in XML, and it would be much easier to transform to other query languages. Of course the XSLT to render the Scopus queries would become a bit more complicated. Now that I have a Antlr grammar that appears to correctly parse the Scopus trees, I suppose it would be pretty easy to use that to automatically transform the former to the latter XML, so that wouldn't have to be done manually.

As an aside, I'm beginning to regret the choice (not mine) for Postgres. The argument at the time was that it supports something like Scopus' W/N proximity operator. But playing around with this, and reading up on Postgres' <N> operator, it's actually subtly different.

For one thing, the Scopus proximity operator is not directional, i.e. A W/3 B matches the same documents as B W/3 A. This is not true in Postgres, so to get an equivalent query I have to emit extra clauses, e.g. (A <3> B) || (B <3> A). (*)

Another complication is that Scopus' <W/3> means "within 3 or fewer words/lexemes" but the Postgres operator is exact. So actually, the equivalent of A W/3 B would be something like (A <1> B) || (B <1> A) || (A <2> B) || (B <2> A) || (A <3> B) || (B <3> A).

Of course, these problems compound very quickly if multiple proximity operators occur in a single query: if I am given A W/3 B W/3 C I will have to emit clauses for each permutation of A, B, and C (6 of them) as well as the cartesian product of the two ranges 1, 2, 3 (9 of them) for a total of 54 (!) clauses. And this is a trivial example, you can imagine I am ending up with some gigantic queries for the real thing!

Finally, I end up not using the more advanced features of Postgres text search, and in fact I have to force it to "simple" mode in order to make the Scopus wildcards work. Postgres would like to help me with this, stemming words in the documents and queries for me, ignoring stop words, and leveraging a built-in thesaurus for synonyms.

But the way the Scopus queries are given here defeats this, for example eradicat* occurs in the Scopus queries but since that isn't a known word, Postgres doesn't know how to stem it -- and so unless I force it to simple mode, a document with the word eradicate or eradication will not match this query, because it will have stemmed the valid word in the document but not the term in the query...

I can imagine, although it will be a lot of work, enriching the Aurora XML with a few valid expansions of the wild-carded terms. That way I can use those in my Postgres queries and have it do its magic.

Anyway, I have to get on with the next phase and unfortunately can't linger on these issues. Just thought I'd mention these observations while they are fresh on my mind. If I have a bit more time, I might revisit this if you are interested and try to come up with some more constructive / concrete proposals.

(*) Incidentally, Scopus does also have a directed variant, PRE/N and I wonder if some of the Aurora queries would be more precisely expressed that way.

Aurora-Network-Global / sdg-queries

copy/paste of queries at the bottom of the query_SDGX.xml pages into Scopus fails #6