Open BillPascoe opened 3 days ago
@MufengNiu figure we should address this bug as soon as we can.
The highlighting from TLCMap is working correctly. However, the geoparsing from the text map doesn’t seem to return the correct indexes when the content contains some special characters.
For example, in this text: Example
The ellipsis (...) is being counted as four characters instead of three by the geoparser. This results in incorrect indexing for the places that follow it.
Below an example from the geoparsing response:
The offset (start index) for "Australia" should be 75, and the sentence_start_index should be 57. Because of this issue, the place names are not marked with the correct indexes, leading to alignment problems.
{
"status":"success",
"data":{
"type":"ExtractionResults",
"place_names":[
{
"type":"PlaceName",
"name":"sydney",
"text_position":{
"line":0,
"word":7,
"offset":38,
"sentence_start_index":0,
"sentence_end_index":57
},
"context":"and sincerely trust that they will be sydney reprinted..."
},
{
"type":"PlaceName",
"name":"australia",
"text_position":{
"line":0,
"word":11,
"offset":76,
"sentence_start_index":58,
"sentence_end_index":128
},
"context":"The aborigines of Australia are fast dying out.sydney sydney australia"
},
{
"type":"PlaceName",
"name":"sydney",
"text_position":{
"line":0,
"word":16,
"offset":105,
"sentence_start_index":58,
"sentence_end_index":128
},
"context":"The aborigines of Australia are fast dying out.sydney sydney australia"
},
{
"type":"PlaceName",
"name":"sydney",
"text_position":{
"line":0,
"word":16,
"offset":112,
"sentence_start_index":58,
"sentence_end_index":128
},
"context":"The aborigines of Australia are fast dying out.sydney sydney australia"
},
{
"type":"PlaceName",
"name":"australia",
"text_position":{
"line":0,
"word":11,
"offset":119,
"sentence_start_index":58,
"sentence_end_index":128
},
"context":"The aborigines of Australia are fast dying out.sydney sydney australia"
}
]
}
}
I uploaded and parsed Petries Reminiscences (from Gutenberg) and created a layer. The resulting text map is highlighting placenames wrongly. The first occurrences are correct. The first occurence of 'Brisbane' do not have the 'B' highlighted. Then some are out by 4 characters, and lower down in the document they are completely wrong. The link does link to the correctly place on the map. So it appears the parsing and geolocating etc is correct and the only problem is with highlighting the text - so it is probably an issue with the line and character numbers. It is cumulative, as if there is a recurring glitch that puts out the indexing a character each time it happens.
https://test-views.tlcmap.org/dev/textmap.html?load=https%3A%2F%2Ftest-ghap.tlcmap.org%2Flayers%2F1421%2Fjson%3Ftextmap%3Dtrue
PetriesReminiscences.txt