dotnet / Open-XML-SDK

Open XML SDK by Microsoft
https://www.nuget.org/packages/DocumentFormat.OpenXml/
MIT License
3.98k stars 545 forks source link

How to parse embedded file(OLE obejct) in pptx/docx #644

Closed hong1997 closed 4 years ago

hong1997 commented 4 years ago

Before submitting an issue, please fill this out

Is this a:

How to parse embedded files(OLE obejct) in pptx/docx. They are Ole objects mostly, like object1.bin. If there're any good ways to parse it? Unzip the OLE object, there're several kinds of format: image image image image

Didn't find out a general good way to achieve that. I check the source code of Tika parser, they extract it in a rule-based method...

// Please add a self-contained, minimum viable repro of the issue.
// If you require external resources, please provide a gist or GitHub repro
// An Xunit style test is preferred, but a console application would work too.

Observed

Please add your observed behavior here

Expected

Please add your expected behavior here.

ashahabov commented 4 years ago

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}
hong1997 commented 4 years ago

Use follow code example to get OLEObjects from the first slide presentation:

public static IEnumerable<DocumentFormat.OpenXml.Presentation.GraphicFrame> GetOleObjects(string pptxFilePath)
{
    using (var doc = PresentationDocument.Open(pptxFilePath, false))
    {
        // Gets first slide
        var sld = doc.PresentationPart.SlideParts.First().Slide;
        // OLEObjects is stored in graphic frame element
        var oleFrames = new List<DocumentFormat.OpenXml.Presentation.GraphicFrame>();
        foreach (var frame in sld.CommonSlideData.ShapeTree.OfType<DocumentFormat.OpenXml.Presentation.GraphicFrame>())
        {
            if (frame.Descendants<DocumentFormat.OpenXml.Presentation.OleObject>().Any())
            {
                oleFrames.Add(frame);
            }
        }

        return oleFrames;
    }
}

Hi adamshakhabov, thanks for your reply! According to my knowledge, the ole object should be stored in embedded object parts(X.MainDocumentPart.EmbeddedObjectParts), and I am asking for a method to parse the oleobject instead of just getting it.

ashahabov commented 4 years ago

Hi @hong1997!

I think Open XML SDK has not some specific method for OLEObject element reading (parse its properties). Can you say more precise, which one feature of OLEObject you try to parse?

Also, it would be better if you attach pptx-file with this OLEObject case.

ThomasBarnekow commented 4 years ago

@hong1997 and @adamshakhabov, GitHub issues are not the place to ask and discuss questions regarding Open XML SDK library usage. You should ask usage-related questions on stackoverflow.com, where you will already find a large number of questions and answers tagged with openxml or openxml-sdk.

In this specific case, another user already asked about how he could extract OLE-embedded files from Word documents, and I provided an accepted answer.

hong1997 commented 4 years ago

@ThomasBarnekow , thanks for your info, I will close the issue. However, the answer you provided only handles 1 kind of OLE structure. You could see from my description that only the last kind of ole object can be handled by the class you provided.

lindexi commented 4 years ago

Some of the OLE can show as wmf image. Because it contain the fallback element. Here is my code that save the fallback element to file https://github.com/lindexi/lindexi_gd/tree/d182ca9f0cece56d32a801923a1fdffa64f95dfd/NallwerewawchailawileeForeehakel .

Some ole can use WinForms to convert. The DotNet Heaven: Read OLE Object type image field in C#.net

twsouthwick commented 4 years ago

Thanks everyone for an interesting discussion. This looks to have been resolved so I'll close the issue.