Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.65k stars 704 forks source link

LangChain + Unstructured: Failed to load file ${filePath} using unstructured loader. #3158

Closed ajaykrupalk closed 3 months ago

ajaykrupalk commented 3 months ago

I am using LangChain's Azure Storage Blob Container Loader to load some JSON files but I am not able to do the same. But the same files as .txt works.

I am getting the below mentioned error for JSON files: Failed to load file ${filePath} using unstructured loader. Error: Failed to load file C:\Users\DELL\AppData\Local\Temp\azureblobfileloader-ePGCee\log2.json using unstructured loader.

My files are primarily JSON files which are logs for APIs. Also is there a way to just get a particular key from JSON rather than the whole JSON file. The code that I have used is as below

import { AzureBlobStorageContainerLoader } from "@langchain/community/document_loaders/web/azure_blob_storage_container"
import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";

const loader = new AzureBlobStorageContainerLoader({
    azureConfig: {
        connectionString: "<connection_string>",
        container: "<container>",
    },
    unstructuredConfig: {
        apiUrl: "http://localhost:8000/general/v0/general",
        apiKey: "<api_key>", // this will be soon required
        logging: true
    },
});
tbs17 commented 3 months ago

hi @ajaykrupalk , thanks for bringing up the issue. Would you mind also pasting code about how you load in your ${filePath} using unstructured and also the successful example of loading .txt file?

ajaykrupalk commented 3 months ago

Hi @tbs17 I am not sure if this is exactly from LangChain or Unstructured. But since the AzureBlobStorageContainerLoader requires a use of Unstructured, I thought this might be from it. The extra code than the one above is const docs = await loader.load(); Would love to know if this is more of a LangChain or Unstructured error.

import { AzureBlobStorageContainerLoader } from "@langchain/community/document_loaders/web/azure_blob_storage_container"
import { UnstructuredLoader } from "@langchain/community/document_loaders/fs/unstructured";

const loader = new AzureBlobStorageContainerLoader({
    azureConfig: {
        connectionString: "<connection_string>",
        container: "<container>",
    },
    unstructuredConfig: {
        apiUrl: "http://localhost:8000/general/v0/general",
        apiKey: "<api_key>", // this will be soon required
        logging: true
    },
});

const docs = await loader.load();

For .txt files there is no change as such in code, it is just .txt files uploaded to Azure Blob Storage

ajaykrupalk commented 3 months ago

Closing this, used an alternative to this by using a custom function and changed the unstructured module to a native JSON loader from LangChain