CDRH / datura

Datura is a ruby gem that manages data (TEI-XML, CSVs, VRA-XML, etc) and populates Solr / Elasticsearch instances. Datura also generates HTML for the formats to allow serving the contents via web
6 stars 5 forks source link

text is squishing together #179

Closed karindalziel closed 2 years ago

karindalziel commented 3 years ago

Some text dumped in the text field is squishing together like:

this is the titleThis is the body text

We need to make sure that spaces are always added between different xml selections

related to https://github.com/CDRH/chesnutt/issues/87

jduss4 commented 3 years ago

I thought there was already an issue for this somewhere, but I'm not finding it. Anyhoo, you can temporarily sub in the code for how it's being handled for ~Whitman~ Cather, but long-term we ought to decide whether we would rather dump in text, knowing that things will be compressed if there are not spaces <tag>between</tag><tag>tags</tag> or whether we would rather switch datura to doing a more intense bunch of processing to add spaces between each element regardless.

https://github.com/Willa-Cather-Archive/data_cather/blob/4e068902b9b2fb227a44589e8d6d697c01f409d7/scripts/overrides/tei_to_es.rb#L6

It is not elegant, but it's basically going through every node in the tree and asking if it's a text node. If so, then it gets added to an array.

def text
    all_text = []
    all_text += text_additional
    text_eles = @xml.xpath(@xpaths["text"])
    text_eles.each do |t|
      t.traverse do |node|
        if node.class == Nokogiri::XML::Text
          all_text << CommonXml.normalize_space(node.text)
        end
      end
    end
    all_text.join(" ")
  end
karindalziel commented 2 years ago

traversing back through the datura methods, I wonder if I need to make a change in this method: https://github.com/CDRH/datura/blob/29f79ef29e414637a7c38b2991f56129c63f13d3/lib/datura/to_es/xml_to_es.rb#L73

When I try to add a space using a text method it changes the class to string rather than a nokogiri object, which breaks the methods up the chain. Plus, if I added such a thing here, it would also add spaces to all the keyword fields, which we don't want

karindalziel commented 2 years ago

also see: https://github.com/CDRH/chesnutt/issues/87

wkdewey commented 2 years ago

I don't see the issue on chesnutt. The text fields on the items here https://cdrhdev1.unl.edu/chesnutt/search?utf8=%E2%9C%93&q=The+Wife+of+His+Youth seem to have normal spacing, both on the site and in the API.

karindalziel commented 2 years ago

Will's solution (#186) forks for the majority of use cases we run into, but I am finding some instances where it breaks. I'm just not sure of the solution, except that we may need to overwrite this method for some projects.

The issue appears when one word is broken up by XML

https://github.com/CDRH/data_test/blob/main/source/tei/wfc.bsn00032.xml

W<hi rend="sup">m</hi>

this appears in the index as W m

I think this is rare enough that if we have to choose one solution or another the way Will's PR fixes it is preferable.

Screen Shot 2021-12-10 at 12 56 48 PM

techgique commented 2 years ago

I think the rare cases where a word is split by encoding is going to be fewer words negatively impacted in search than currently are, so I'd vote for going with Will's fix and seeking remedies for the outliers with encoding mid-word as another issue later

wkdewey commented 2 years ago

Fixed by my PR