MathWebSearch / mws

MathWebSearch Implementation
https://search.mathweb.org/
GNU General Public License v3.0
46 stars 12 forks source link

Error in MWS unification? #93

Closed kohlhase closed 9 years ago

kohlhase commented 9 years ago

I have been playing with the new OEIS Search engine and its seems that unification does not work correctly any more: The query ?x+?y=?y+?x yields the result 6^2+7^2=9^2+2^2 (no unique value for ?x and ?y). You can play with this yourself: http://ash.eecs.jacobs-university.de:9999/?query-text=&query-math=%3Fx%2B%3Fy%3D%3Fy%2B%3Fx#1

Raduh commented 9 years ago

I investigated this and it seems to be an error in the parser. Instead of being translated as <cn>7</cn>, numbers are translated as <lit value="7"/>

kohlhase commented 9 years ago

ah, and <lit ... /> = <lit .../>?

So that is probably something that @m-iancu should know about.

kohlhase commented 9 years ago

but it is good to know that I did not find a unification error.

Raduh commented 9 years ago

Yes, because attributes don't influence the value of a token.

m-iancu commented 9 years ago

It has to do with MMT literals. We don't have OMI/OMB/OMSTR/OMF as primitive concepts in MMT anymore but rather as instances of Literals of type integer/byte/string/float. The content mathml serialization is, at the moment, <lit value="val" type="type"/> so to MWS all literal nodes look identical (and hence unify). I guess that part is wrong, I don't even think that is valid MathML. In any case it's straightforward to fix it for the few main cases, but I'm not sure what's the best way to do it generically. Maybe @florian-rabe can correct me/explain more.

kohlhase commented 9 years ago

Ah, I understand. the <lit...> is not proper MathML, so at least in this case, i.e. for MWS which wants MathML we want to postprocess <lit val='foo' type='type'/> to <cn type='type'>foo</cn> during the harvest, if type is one of the number types int, real, float,...

kohlhase commented 9 years ago

here is @dginev's take for the record Greetings from an interested third-party I'll chip in, with just my own opinion.

On 07/27/2015 11:23 AM, Hambasan, Radu wrote:

This was not a unification error. I followed up with Mihnea and apparently MMT now generates instead of 5 for numbers in Content MathML.

<lit> isn't content MathML. It should never reach MWS.

Hard coupling between systems kills projects, obeying open standards makes them a healthy part of a much larger ecosystem.

I think the best way of solving this would be in MMT since MWS should preseve backwards compatibility and the other indexes also use cn.

Thumbs up, when MMT is integrating with a Content MathML search engine, it should be exporting standards-compliant Content MathML for MWS to index.

m-iancu commented 9 years ago

The underlying problem is that MMT is now more expressive than MathML in that regard so its not clear what to produce in some cases. After talking to @florian-rabe we reached the conclusion that for practical purposes we can use cn and use the type attribute to put the literal URI in there. i.e. <cn type="MMT URI"> value </cn>.

kohlhase commented 9 years ago

I do not think that this is the right way to go MathML3 specifies the values of type in strict content MathML as type Attribute Values "integer" | "real" | "double" | "hexdouble".

I think that we should do the followign:

<cbytes encoding='MMT Literal' definitionURL='MMT URI'>base64encoded</cbytes>

that is much closer to the intent of MathML. BUT that being said, <cbytes> should never reach MathWebSearch.

m-iancu commented 9 years ago

That would be correct way but doesn't help if we have a literal for e.g. natural numbers, and we want to get results from integers or reals with the same value in MWS. With this encoding we wouldn't because cbytes and cn are different. We can maybe use the same attribute (definitionURL) for cn instead. I didn't quite understand its semantics from the MathML standard though (its not really explained for cn/cs/... But it does have it in non-strict: http://www.w3.org/TR/MathML3/chapter4.html#contm.cn

kohlhase commented 9 years ago

That would be correct way but doesn't help if we have a literal for e.g. natural numbers, and we want to get results from integers or reals with the same value in MWS.

that is thy I said "coerced into one of the three" so <lit type='nat' value='3'/> would become <cn type='integer'>3</cn> and <lit type='int' value='3'/> the same. And <lit type='float' value='3'/> would become <cn type='real'>3</cn> which I guess MWS would identify with the integer version (attributes do not matter).

Any literal that cannot be coerced into a <cn> should not be unifiable with a <cn>.

kohlhase commented 9 years ago

We can maybe use the same attribute (definitionURL) for cn instead.

that is a very good idea. We should do that.

kohlhase commented 9 years ago

I didn't quite understand its semantics from the MathML standard though (its not really explained for cn/cs/... But it does have it in non-strict: http://www.w3.org/TR/MathML3/chapter4.html#contm.cn

Indeed, maybe <cs> is better than <cbytes>, then we do not have to base64encode. Then we can use the encoding attribute for the type.

m-iancu commented 9 years ago

Ok so should we use ? :

  1. For numbers e.g. <cn type="integer" encoding="mmt-lit" definitionURL="http://cds.omdoc.org/literals/naturals"> 3 </cn>
  2. For strings/default e.g. <cs encoding="mmt-lit" definitionURL="http://cds.omdoc.org/literals/some-data"> foo </cs>
  3. Or maybe cbyte for default case

In the current implementation we can use cn for default since we only have numbers anyway

kohlhase commented 9 years ago

yes, that makes sense.