gamemaker1 / office-text-extractor

Yet another library to extract text from MS Office and PDF files
https://npm.im/office-text-extractor
ISC License
53 stars 7 forks source link

Cannot read .doc file #10

Open abedshaaban opened 9 months ago

abedshaaban commented 9 months ago

Description

An error occurred when reading a .doc file.

Error: text-extractor: could not find a method to handle application/x-cfb

I looked into the code and the type declaration application/x-cfb is not included in the MimeType or doc in FileExtension.

Library version

3.0.2

Node version

20.9.0

Typescript version (if you are using it)

No response

gamemaker1 commented 9 months ago

Hi,

office-text-extractor uses mammoth under the hood to parse ms word files.

mammoth does not support extracting text from docx files.

I tried to write an extractor for it myself, however, I was not able to successfully extract the xml contents from the .doc file. Here is the code, if you want to play with it:

// source/parsers/docx.ts
// The text extracter for DOCX/DOC files.

import { type Buffer } from 'node:buffer'
import { extractRawText as parseWordFile } from 'mammoth'
import { unzip } from 'fflate'
import { parseStringPromise as xmlToJson } from 'xml2js'
import encoding from 'text-encoding'

import type { TextExtractionMethod } from '../lib.js'

export class DocExtractor implements TextExtractionMethod {
    /**
     * The type(s) of input acceptable to this method.
     */
    mimes = [
        'application/x-cfb',
        'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
    ]

    /**
     * Extract text from a DOCX/DOC file if possible.
     *
     * @param payload The input and its type.
     * @returns The text extracted from the input.
     */
    apply = async (input: Buffer): Promise<string> => {
        try {
            // Convert the DOCX to text and return the text.
            const parsedDocx = await parseWordFile({ buffer: input })
            return parsedDocx.value
        } catch (caughtError: unknown) {
            // If the file is a DOC file, then JSZIP will fail to unzip it.
            const error = caughtError as Error
            if (error.message?.includes('Corrupted zip or bug')) {
                const contents = await unzipBuffer(input)
                const json = await xmlToJson(contents)
                const lines = await parseDocSection(json)

                const formattedText = lines?.join('\n') + ''
                return formattedText
            } else {
                // If it is not a DOC file, let the error propagate.
                throw caughtError
            }
        }
    }
}

/**
 * Unzip a DOC file, and return the XML in it.
 *
 * @param buffer The buffer containing the file.
 *
 * @returns The XML.
 */
const unzipBuffer = async (input: Buffer): Promise<Buffer> => {
    // Convert the buffer to a uint-8 array, and pass it to the unzip function.
    const zipBuffer = new Uint8Array(input.buffer)
    const doc = (await new Promise((resolve, reject) => {
        unzip(zipBuffer, (error, result) => {
            if (error) reject(error)
            else resolve(result)
        })
    })) as any

    const file = doc['word/document.xml']
    if (!file) throw new Error('Invalid .doc file, could not find document.xml.')

    return file
}

/**
 * Extracts text from a section of the document, recursively.
 *
 * @param docSection The section of the doc, converted to JSON from XML.
 * @param collectedText The lines of text parsed from the document so far.
 *
 * @returns The lines of text in the document.
 */
const parseDocSection = async (
    docSection: any,
    collectedText?: string[],
): Promise<string[] | undefined> => {
    // Keep track of the text being collected.
    const beingCollectedText = collectedText ?? []

    // Parse the section according to what type it is.
    if (Array.isArray(docSection)) {
        // If it is, loop through the elements of the array.
        for (const element of docSection) {
            // Collect all the pieces of text from the array.
            if (typeof element === 'string' && element !== '') {
                beingCollectedText.push(element)
            } else {
                // However, if it is an object or another array, call this function
                // again to parse that.
                await parseDocSection(element, beingCollectedText)
            }
        }

        // Finally, return the collected text.
        return beingCollectedText
    }

    // If the section is an object, loop through its properties.
    if (typeof docSection === 'object') {
        for (const property of Object.keys(docSection)) {
            // Get the value of the property.
            const value = docSection[property]

            // The `docx` format stores the actual text inside the `w:t` or `_`
            // properties, so extract text from those properties.

            // Check if it is a string or array that contains a string. If it is
            // either, then collect the text content.
            if (typeof value === 'string') {
                if ((property === 'w:t' || property === '_') && value !== '') {
                    beingCollectedText.push(value)
                }
            } else if (typeof value[0] === 'string') {
                if ((property === 'w:t' || property === '_') && value[0] !== '') {
                    beingCollectedText.push(value[0])
                }
            } else {
                // However, if it is an object or another array, call this function
                // again to parse that.
                await parseDocSection(value, beingCollectedText)
            }
        }

        // Finally, return the collected text.
        return beingCollectedText
    }
}

The unzip library, fflate, throws the following error:

Error {
  code: 14,
  message: 'unknown compression type 2346',
}

If you can fix it or work around it in any way, please do let me know!

Siddharth-Latthe-07 commented 1 month ago

@abedshaaban Possible solutions for the error:-

  1. Upgrade Library (Recommended): Check if a newer version of the library you're using supports CFB formatted .doc files. Upgrading to a newer version might have the necessary functionality included. Refer to the library's documentation for compatibility information.

  2. Use a Different Library: If upgrading isn't feasible, consider switching to a different library specifically designed to handle CFB formatted .doc files. Popular options include: docx (pure JavaScript library for reading and manipulating Microsoft Word documents) js-ole (parses various OLE2 formats, including CFB)

  3. Convert the .doc File: As a workaround, you can convert the .doc file to a more widely supported format like .docx before processing it. You can achieve this using online conversion tools or command-line tools like LibreOffice's convert.

Hope this helps, Thanks

Siddharth-Latthe-07 commented 1 month ago

@gamemaker1 The error you're encountering with the unzip function is likely because DOC files are not simple zip archives like DOCX files. DOC files use a different format, known as the Compound File Binary Format (CFBF), also known as OLE2 or just "doc" format, which requires a different approach to extract its contents.

Sample snippet that handles both DOCX and DOC files correctly. This script uses the cfb library to handle DOC files and continues to use mammoth for DOCX files.

import { type Buffer } from 'node:buffer';
import { extractRawText as parseWordFile } from 'mammoth';
import * as cfb from 'cfb';
import { parseStringPromise as xmlToJson } from 'xml2js';

import type { TextExtractionMethod } from '../lib.js';

export class DocExtractor implements TextExtractionMethod {
  /**
   * The type(s) of input acceptable to this method.
   */
  mimes = [
    'application/x-cfb',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document',
  ];

  /**
   * Extract text from a DOCX/DOC file if possible.
   *
   * @param payload The input and its type.
   * @returns The text extracted from the input.
   */
  apply = async (input: Buffer): Promise<string> => {
    try {
      // Try to parse as DOCX
      const parsedDocx = await parseWordFile({ buffer: input });
      return parsedDocx.value;
    } catch (caughtError: unknown) {
      // If it fails, try to parse as DOC
      const error = caughtError as Error;
      if (error.message?.includes('Corrupted zip or bug')) {
        const contents = await extractDocText(input);
        return contents;
      } else {
        // If it is not a DOC file, let the error propagate.
        throw caughtError;
      }
    }
  };
}

/**
 * Extract text from a DOC file.
 *
 * @param buffer The buffer containing the file.
 *
 * @returns The extracted text.
 */
const extractDocText = async (input: Buffer): Promise<string> => {
  const cfbFile = cfb.parse(input);
  const wordDocument = cfb.find(cfbFile, 'WordDocument');
  if (!wordDocument) throw new Error('Invalid .doc file, could not find WordDocument.');

  const documentBuffer = wordDocument.content;
  const documentXml = documentBuffer.toString('utf16le'); // DOC files are typically UTF-16 encoded
  const json = await xmlToJson(documentXml);
  const lines = await parseDocSection(json);

  return lines?.join('\n') + '';
};

/**
 * Extracts text from a section of the document, recursively.
 *
 * @param docSection The section of the doc, converted to JSON from XML.
 * @param collectedText The lines of text parsed from the document so far.
 *
 * @returns The lines of text in the document.
 */
const parseDocSection = async (
  docSection: any,
  collectedText?: string[],
): Promise<string[] | undefined> => {
  // Keep track of the text being collected.
  const beingCollectedText = collectedText ?? [];

  // Parse the section according to what type it is.
  if (Array.isArray(docSection)) {
    // If it is, loop through the elements of the array.
    for (const element of docSection) {
      // Collect all the pieces of text from the array.
      if (typeof element === 'string' && element !== '') {
        beingCollectedText.push(element);
      } else {
        // However, if it is an object or another array, call this function
        // again to parse that.
        await parseDocSection(element, beingCollectedText);
      }
    }

    // Finally, return the collected text.
    return beingCollectedText;
  }

  // If the section is an object, loop through its properties.
  if (typeof docSection === 'object') {
    for (const property of Object.keys(docSection)) {
      // Get the value of the property.
      const value = docSection[property];

      // The `docx` format stores the actual text inside the `w:t` or `_`
      // properties, so extract text from those properties.

      // Check if it is a string or array that contains a string. If it is
      // either, then collect the text content.
      if (typeof value === 'string') {
        if ((property === 'w:t' || property === '_') && value !== '') {
          beingCollectedText.push(value);
        }
      } else if (typeof value[0] === 'string') {
        if ((property === 'w:t' || property === '_') && value[0] !== '') {
          beingCollectedText.push(value[0]);
        }
      } else {
        // However, if it is an object or another array, call this function
        // again to parse that.
        await parseDocSection(value, beingCollectedText);
      }
    }

    // Finally, return the collected text.
    return beingCollectedText;
  }
};

Hope this helps Thanks