Closed cprodescu closed 10 years ago
JSON document generation should follow this pattern:
{ "./data/corpus/ph-alp01.html" : { "metadata" : { "title" : "The principles of foo" }, "mws_ids" : [1, 2], "mws_id" : { "1" : { "m2.1" : {"xpath" : "/1/2/" }, "m2.3" : {"xpath" : "/2322/232" } }, "2" : { "m2.2" : {"xpath" : "/2/2/3"} } }, "math" : { "m2.1" : "<math id=\"m2.1\" display=\"inline\"><semantics>...", "m2.2" : "<math ..." }, "text" : "A relatively small change in #m2.1 would lead to ..." } }
Currently, it looks more like this: https://gist.github.com/cprodescu/98265e8b6d6f487aef08 Issues: all mws ids are gathered per harvest, not per document, mws_id, math and text are not as different elements.
JSON document generation should follow this pattern:
Currently, it looks more like this: https://gist.github.com/cprodescu/98265e8b6d6f487aef08 Issues: all mws ids are gathered per harvest, not per document, mws_id, math and text are not as different elements.