MathWebSearch / mws

MathWebSearch Implementation
https://search.mathweb.org/
GNU General Public License v3.0
46 stars 12 forks source link

Cleanup json document generation #46

Closed cprodescu closed 10 years ago

cprodescu commented 10 years ago

JSON document generation should follow this pattern:

{
  "./data/corpus/ph-alp01.html" : {
    "metadata" : {
      "title" : "The principles of foo"
    },
    "mws_ids" : [1, 2],
    "mws_id" : {
      "1" : {
        "m2.1" : {"xpath" : "/1/2/" },
        "m2.3" : {"xpath" : "/2322/232" }
      },
      "2" : {
        "m2.2" : {"xpath" : "/2/2/3"}
      }
    },
    "math" : {
      "m2.1" : "<math id=\"m2.1\" display=\"inline\"><semantics>...",
      "m2.2" : "<math ..."
    },
    "text" : "A relatively small change in #m2.1 would lead to ..."
  }
}

Currently, it looks more like this: https://gist.github.com/cprodescu/98265e8b6d6f487aef08 Issues: all mws ids are gathered per harvest, not per document, mws_id, math and text are not as different elements.