Query PPTX - Githubissues

kvistgaard commented 7 months ago

I always wanted to query my local slidedecks (currently over 300) and, if possible link them with data from other files.

Although PPTX is sort a kind of an XML, SPARQL Anything does not recognise it as such. I guess it needs dedicated support for fx:media-type "application/vnd.openxmlformats-officedocument.presentationml.presentation" .

I believe this will be a highly used feature that's worth implementing.

justin2004 commented 7 months ago

in the short term you could do something like i did here.

sparql anything can't talk to a SQL DB but you can put a little translation in like this:

PREFIX  xyz:  <http://sparql.xyz/facade-x/data/>
PREFIX  fx:   <http://sparql.xyz/facade-x/ns/>
SELECT  *
WHERE
  { SERVICE <x-sparql-anything:>
      { fx:properties
                  fx:command      "echo \\\\d | PGPASSWORD=mysecretpassword psql -h 172.17.0.1 -p 5432 -U postgres --csv -f /dev/stdin" ;
                  fx:media-type   "text/csv" ;
                  fx:csv.headers  "true" .
        []      xyz:Name        ?table_name
      }
  }

it looks like this tool can convert pptx to json (which sparql anything supports)

i would look something like this:

PREFIX  xyz:  <http://sparql.xyz/facade-x/data/>
PREFIX  fx:   <http://sparql.xyz/facade-x/ns/>
SELECT  *
WHERE
  { SERVICE <x-sparql-anything:>
      { fx:properties
                  fx:command      "jodconverter-cli some.pptx /dev/stdout" ;
                  fx:media-type   "application/json" .
        ?s ?p ?o
      }
  }

if you would actually use it i can test and make sure it actually works.

kvistgaard commented 7 months ago

Thanks @justin2004

I don't quite get the option with the SQL DB since PPTX, unlike PPTX, is sort of a compressed XML file (unlike PPT).

Regarding the second option, I might use it depending on the performance of the conversion. I have some big decks for training courses that are between 30MB and 100MB.

luigi-asprino commented 7 months ago

Hi,

nice use case :-) Technically, I think we can use Apache POI library (the same used for .docx files) for extracting information from slide decks. Regarding the representation, it is rather straightforward to see the presentation as a container whose numbered slots are the slides. However, each slide can be structured as a document, namely having slots filled with typed containers.


_:presentation a xyz:Presentation .
   rdf:_1 [
       rdf:_1 [
           #slide 1 
           a xyz:Title ; rdf:_1 "Title of the slide1" 
      ]
      rdf:_2 [
          a xyz:Paragraph .
         rdf:_1 "First paragraph of the slide 1" 
    ]
]

rdf:_2 [
       rdf:_2 [
           #slide 2
           a xyz:Title ; rdf:_1 "Title of the slide2" 
      ]
      rdf:_2 [
          a xyz:Paragraph .
         rdf:_1 "First paragraph of the slide 2" 
    ]
]  .
...

kvistgaard commented 7 months ago

That would be great. In the last three years, I'm making presentations mostly from my personal knowledge graph and it is indeed very useful to be able to query them all (in my case it's with Datalog but that's very similar to SPARQL, esp SPARQL algebra) to the level of the smallest element (in my case blocks have URI and a block can be a slide or a nested content element, such as text paragraph, image, video or iframe). Being able to do something similar with PPTX would be great: all files named graphs, representing datasets with slots of slides and paragraphs.

Sadly, dereferencible links to slides won't be possible but that due to a bad design decision by Microsoft.

luigi-asprino commented 7 months ago

Hi,

a first implementation of the pptx triplifier is ready https://github.com/SPARQL-Anything/sparql.anything/blob/v0.9-DEV/formats/Slides.md Will be included in the upcoming new release (see #419)

kvistgaard commented 7 months ago

@luigi-asprino Wow, that was fast 👏. Well done!

Some feedback:

Why for Slide and Presentation types the current URI naming convention is kept, but for the rest it's all caps? (was it to distinguish container types and content types?)
In the future, it would be nice to capture and type more elements. I would image something like that:

Presentation (Section [Slide (Title) (Paragraph) (Bullet) ] ) , where ( ) denotes optional Of course, Paragraph and Bullet are the most important but for big slidedecks I sections are used often. Here are for example the section from the slidedeck of my SPARQL course:

Is there a way to retrieve hyperlinks? (if not those on images, at least those on text)
There is something strange which I'm not able to diagnose. When I run this query

SELECT ?slideNumber ?type ?text 
WHERE {
  SERVICE <x-sparql-anything:>
  {
   fx:properties fx:media-type  "application/vnd.openxmlformats-officedocument.presentationml.presentation" ;
   fx:location "https://sparql-anything.cc/examples/Presentation1.pptx"
         .
    ?Presentation a xyz:Presentation ;
                  ?hasSlide [?hasContent [ a ?type ;
                                rdf:_1 ?text ;]
                ]
  }
  BIND (xsd:int(STRAFTER(STR(?hasSlide), "_")) AS ?slideNumber)
}

ORDER BY ?slideNumber ?hasContent

it gives the expected result on the example presentation. But when I tried on a few of mine, it starts from 2. I'll make a few more tests to try to identify the pattern, but I thought that you might have a clue.

luigi-asprino commented 7 months ago

Why for Slide and Presentation types the current URI naming convention is kept, but for the rest it's all caps? (was it to distinguish container types and content types?)

It is due to the naming convention of the underlying library for reading pptx (Apache POI).

In the future, it would be nice to capture and type more elements. I would image something like that:

Presentation (Section [Slide (Title) (Paragraph) (Bullet) ] ) , where ( ) denotes optional Of course, Paragraph and Bullet are the most important but for big slidedecks I sections are used often. Here are for example the section from the slidedeck of my SPARQL course:

Let's see if sections can be extracted, I opened a dedicated issue #435.

Is there a way to retrieve hyperlinks? (if not those on images, at least those on text)

See #436

There is something strange which I'm not able to diagnose. When I run this query
SELECT ?slideNumber ?type ?text 
WHERE {
  SERVICE <x-sparql-anything:>
  {
   fx:properties fx:media-type  "application/vnd.openxmlformats-officedocument.presentationml.presentation" ;
   fx:location "https://sparql-anything.cc/examples/Presentation1.pptx"
         .
    ?Presentation a xyz:Presentation ;
                  ?hasSlide [?hasContent [ a ?type ;
                                rdf:_1 ?text ;]
                ]
  }
  BIND (xsd:int(STRAFTER(STR(?hasSlide), "rdf:_")) AS ?slideNumber)
}

ORDER BY ?slideNumber ?hasContent
it gives the expected result on the example presentation. But when I tried on a few of mine, it starts from 2. I'll make a few more tests to try to identify the pattern, but I thought that you might have a clue.

Could you please share an example?

Thanks!

kvistgaard commented 7 months ago

Could you please share an example?

Sure. Here's it https://github.com/kvistgaard/sparql/raw/main/slides/SPARQLcourse_First2Slides.pptx

And I corrected the query. So, the corrected query gets all the slides from your example, and with the one I shared, it skips the first.

( I deleted all but the first two slides from the original deck but kept the sections so it can be used for tests for #435 if needed.)

SPARQL-Anything / sparql.anything

Query PPTX #429