RoamJS / workbench

https://roamjs.com/extensions/workbench
The Unlicense
286 stars 35 forks source link

Accounting for new lines in OCR feature #472

Open mattakamatsu opened 5 months ago

mattakamatsu commented 5 months ago

The OCR feature is terrific, with one exception: whenever there is a new line, the OCR does not include a space between words on subsequent lines. For example:

tilted at +10-20 degrees.Based on the degree of invagination, CCSs were classified into threecategories.

Can we add a space for words between new lines? I asked GPT4 how to do this, and here's what it suggested:

// Inside the tesseractImage.onload = async () => { ... }

const {
  data: { text },
} = await worker.recognize(canvas);
await worker.terminate();

const textBullets = text.split("\n");
const bullets = [];
let currentText = "";
for (let b = 0; b < textBullets.length; b++) {
  const s = textBullets[b].trim(); // Trim to remove leading and trailing whitespaces
  if (s) {
    if (currentText && !currentText.match(/[\.,!?\)\]\:;\-]$/)) {
      // Add a space before the new text if the last character is not a punctuation mark that typically does not follow a space
      currentText += " ";
    }
    currentText += s;
  } else if (currentText) {
    // Push the currentText into bullets when encountering an empty string (newline), and reset currentText
    bullets.push(
      currentText.startsWith("* ") ||
      currentText.startsWith("- ") ||
      currentText.startsWith("— ")
        ? currentText.substring(2)
        : currentText
    );
    currentText = "";
  }
}
if (currentText) {
  // Ensure any remaining text is also pushed into bullets
  bullets.push(
    currentText.startsWith("* ") ||
    currentText.startsWith("- ") ||
    currentText.startsWith("— ")
      ? currentText.substring(2)
      : currentText
  );
}

// The rest of your logic to create blocks from bullets remains unchanged.