Linking to audio files in an external CDN

stcoats commented 1 year ago

After a few initial hiccups, I've been having good success setting up a corpus using the instructions. Thank you for the good documentation!

I want to link text sentences to an audio file, like how has been done in the OpenSonar corpus. Say I have an audio file called my_wav.wav, stored at https://mycdn.com/my_wav.wav. My xml has this structure (simplified):

<?xml version="1.0" ?>
<root>
<document>
<metadata id="id1">
</metadata>
<text>
<p><s n="196" file="my_wav.txt">
<w xml:id="w.4287" pos="PRP" lemma="I">i</w>
<w xml:id="w.4288" pos="VBD" lemma="feel">felt</w>
<w xml:id="w.4289" pos="JJ" lemma="bad">bad</w>
<w xml:id="w.4290" pos="IN" lemma="for">for</w>
<w xml:id="w.4291" pos="VBG" lemma="call">calling</w>
<w xml:id="w.4292" pos="PRP" lemma="you">you</w>
</s></p>
</text>
<externalMetadata id="my_wav.wav"/>
</document>
</root>

My .yaml file contains this:

metadata:

    # Should we store the linked document in our index?
    # (in this case, a field metadataCid will be created that contains a content
    #  store id, allowing you to fetch the original content of the document later)
    store: true

    # Values we need for locating the linked document
    # (matching values will be substituted for $1-$9 below)
    linkValues:

      # The value we need to determine the URL to our metadata
      # (relative to documentPath)
    - valuePath: externalMetadata/@id

    # How to fetch the linked input file containing the linked document.
    # File or http(s) reference. May contain $x (x = 1-9), which will be replaced 
    # with linkValue
    inputFile: https://mycdn.com/$1

    # (Optional)
    # If the linked input file is an archive (zip is recommended because it allows 
    # random access), this is the path inside the archive where the file can be found. 
    # May contain $x (x = 1-9), which will be replaced with (processed) linkValue
    #pathInsideArchive: some/dir/$1

    # Format identifier for indexing the linked file
    inputFormat: my-metadata-format

This does not seem to be working, probably because I haven't quite understood what is going on. How can I achieve this?

jan-niestadt commented 1 year ago

You're trying to use the linked documents feature to link audio files, but this won't work. This feature is useful when you have some external XML or CSV file with metadata like title, author, etc. that you want to apply to a document while indexing.

If you just want a play button in the full document view (generated from XML using XSLT) that plays an external audio file, you should be able to do that just by crafting the XSLT to determine the correct URL for the audio file for a sentence, paragraph, etc.

If you want to show a play button when you click on a hit in the results view, you will need to index any required information (audio file name/number, start/end timecodes) as annotations for each word, so that your custom.js can determine how to play the correct audio for any hit. (I take it you've already seen https://github.com/INL/corpus-frontend#custom-js ?)

Here's the relevant part of our OpenSonar format config file (.blf.yaml):

annotatedFields:

  contents:
    wordPath: .//folia:w

    # If specified, a mapping from this id to token position will be saved, so we
    # can refer back to it for standoff annotations later.
    tokenPositionIdPath: "@xml:id"

    annotations:
    - name: word
      valuePath: folia:t

    # Store part of the xml:id attribute so we can find the corresponding audio file
    # (xml id contains the document and sentence ids, which identifies the audio file)
    - name: _xmlid
      valuePath: "@xml:id"           # NOTE: xml:id of w tag
      isInternal: true
      process:
      - action: replace
        find:    "^[^\\.]*\\.(.*)$"  # find first .
        replace: "$1"                # keep everything after that

    # Separate standoff annotations give the begin and end time for each word.
    # We refer back to the tokenPositionIdPath captured above so they are indexed at the correct position.
    standoffAnnotations:
    - path: //timesegment                   # Element containing the values to index
      refTokenPositionIdPath: wref/@xml:id  # What token position(s) to index these values at
      annotations:                          # Annotation(s) to index there
      - name: begintime
        valuePath: ../@begintime
        isInternal: true
      - name: endtime
        valuePath: ../@endtime
        isInternal: true

jan-niestadt commented 1 year ago

And here's a small excerpt from the XML data:

<s speaker="BACKGROUND" xml:id="fn007233.1">
  <w xml:id="fn007233.1.1">
    <t>achtergrondmuziek.</t>
      <pos class="SPEC(comment)" head="SPEC">
      <feat class="comment" subset="spectype"/>
      </pos>
      <lemma class="_"/>
    </w>
  <timing>
  <timesegment begintime="00:00:00.000" endtime="00:05:29.048">
    <wref id="fn007233.1.1" t="achtergrondmuziek."/>
  </timesegment>
</timing>
</s>

stcoats commented 1 year ago

Thank you, once again, for such a speedy response! The https://github.com/INL/corpus-frontend#custom-js page seems to be exactly what I need. It will take some time for me to figure out how this works, so I will follow up once I have something working.

KCMertens commented 1 year ago

Big reply, but you've picked probably the most complex thing you can do in the frontend, strap in! ^{(I'm more than aware this isn't ideal)}

The reason it's a little involved is twofold;

BlackLab doesn't know anything about audio files, it just stores textual content. So we need to tell the corpus-frontend which text maps to which audio file, and potentially which time within the audio file too.
The frontend has two different ways of rendering the document's contents, which need a separate treatment
- Once in the hits table (Assuming you've got access to OpenSonar, that's this one here) This one is rendered using only info from BlackLab
- once when actually clicking through into a document, like here While this is rendered by transforming the raw XML into HTML, using XSLT

First, index the relevant info about the audio file in BlackLab. It seems your audio is one file per document, so we'll store the audio file name in the document metadata. The example from our OpenSonar config stores it per word (as annotations begintime and endtime), as we have more precise info.


metadata: 
- containerPath: document
  fields: 
  - name: audiofile
    isInternal: true # this doesn't do anything in BlackLab, but prevents the corpus-frontend from showing a filter for this field
    valuePath: ./externalMetadata/@id

Now to configure the frontend, we'll need to add a custom snippet of javascript.

Create a config directory for the corpus-frontend (if not already done): set the corporaInterfaceDataDir in your corpus-frontend.properties
Create the following folder structure and files (replace the directory names except for the dir called static) static is a special dir that the corpus-frontend can serve files from. So you can store various clientside config files and assets here. Available under /corpus-frontend/${corpus_id}/static/...
```
/corporaInterfaceDataDir
/my_corpus_name
search.xml
article.xsl
/static
  custom.js
```

In search.xml, add the following line, importing your newly created custom.js file on the search page: <CustomJs page="search">${request:corpusPath}/static/custom.js</CustomJs>

Now edit your custom.js and configure the plugin that renders the audio button. There's a little more documentation in the Readme, if you need it search for audio player. But for your example this should work.

/*  The context object contains the following information:
  {
    corpus: string, 
    docId: string, // the document id
    snippet: BLTypes.BLHitSnippet, // the raw hit info as returned by blacklab
    document: BLTypes.BLDocInfo, // the document metadata (just a key-value map of all metadata, values contained in arrays!)
    documentUrl: string, // url to view the document in the corpus-frontend
    wordAnnotationId: string, // configured annotation to display for words (aka vuexModules.ui.results.hits.wordAnnotationId)
    dir: 'ltr'|'rtl',
    citation: {
      left: string;
      hit: string;
      right: string;
    }
  }

  The returned object should have the following shape:
  {
    name: string; // unique name for the widget you're rendering, can be anything
    component?: string; // (optional) name of the Vue component to render, component MUST be globally installed using vue.component(...)
    element?: string; // when not using a vue component, the name of the html element to render, defaults to 'div'
    props?: any; // attributes on the html element (such as 'class', 'tabindex', 'style' etc.), or props on the vue component
    content?: string // html content of the element, or content of the default slot when using a vue component
    listeners?: any; // event listeners, passed to v-on, so 'click', 'hover', etc. 
  }
*/
vuexModules.ui.getState().results.hits.addons.push(function(context) {
  return {
      component: 'AudioPlayer', // don't change this!
      name: 'audio-player', // this may be whatever
      props: {
          docId: context.docId, // for caching
          startTime: 0,
          endTime: Number.MAX_SAFE_INTEGER, // since we don't have a defined endtime, just set a high number
          url: `${your_cdn}/${context.document.audiofile[0]}`
      },
  }
})

Now edit your newly created article.xsl. Explaining XSLT is a little out of scope for this, but luckily BlackLab can generate a basic setup: Make sure the .blf.yaml file you used to index your corpus is loaded in blacklab-server or it'll 404. You might have to edit BlackLab's config file to do that. Then go to http://localhost:8080/blacklab-server/input-formats/${my_format_name}/xslt and save it as article.xsl. You can then add the snippet that renders the play button. We use the following setup for OpenSonar (see the snippet Jan posted above for the corresponding XML). You'll have to edit it to match your document structure.

<xsl:variable name="audiofile" select=".//externalMetadata/@id"/>
<xsl:variable name="begintime" select="'0'"/>
<xsl:variable name="endtime" select="'999999'"/>
<button type="button" class="btn btn-sm btn-default audio-button">
    <xsl:attribute name="data-audio-start"><xsl:value-of select="$begintime"/></xsl:attribute>
    <xsl:attribute name="data-audio-end"><xsl:value-of select="$endtime"/></xsl:attribute>
        <xsl:attribute name="data-audio-file"><xsl:value-of select="$audiofile"/><xsl:attribute>
    <span class="fa fa-play"></span>
</button>

Then we have a corresponding javascript file that brings it to life, I've attached it: article_enable_audio.js.txt

You'll have to edit that, since we share 1 audio file over many play buttons, so there's some caching involved in the script. And we have timing information. But it should be enough to get you started. Anyway, stick that js file in the static dir and import it on the article view page. search.xml: <CustomJs page="article">${request:corpusPath}/static/article_enable_audio.js</CustomJs>

stcoats commented 1 year ago

Wow, thank you for this detailed reply! It will take me some time to go through this and try things out.

A question: I already have javascript code that can fetch a something from a cdn, render an audio player, and play the file on an html page. Is there a way to put this into the html-rendered search.xml page, so as not to have to deal with XLST and XSL scripts?

Or even better, render the search interface directly in HTML, use scripts to show the hits and context, etc. and forego all XML, XLST, and XSL?

KCMertens commented 1 year ago

I'm not sure I completely understand what you want. But sure, you can just add whatever javascript you want on any page by adding the <CustomJS> tag insearch.xml. No further config needed.

But how will you know what audio file to play for what hit? Also, the search page is very dynamic, html changes constantly when you perform a new search, or load another page of results. So if you insert an audio player somewhere in the table, it will be thrown away when new results are shown (in the best case). You will have to take care of disposing old audio players, create new ones, etc.

For when viewing a single document it'll probably work though. That page is a lot less complex so you just get some html and it won't change.

stcoats commented 1 year ago

My structure is basically

<document>
<metadata>
...
</metadata>
<s link = "link1">
<w xml:id="w.1" pos="PRP" lemma="I">i</w>
<w xml:id="w.2" pos="VBD" lemma="feel">felt</w>
<w xml:id="w.3" pos="JJ" lemma="good">good</w>
</s>
<s link = "link2">
<w xml:id="w.4" pos="PRP" lemma="you">you</w>
<w xml:id="w.5" pos="VBD" lemma="feel">felt</w>
<w xml:id="w.6" pos="JJ" lemma="bad">good</w>
</s>
</document>

I am hoping that for each hit on a word/lemma/pos, I can insert an audio player below the hit, using the link in the <s> element for the sentence that word is in. I will experiment with doing it on the hits page and the document page.

Thank you once again for your quick responses! You and Jan are super helpful. 😊

stcoats commented 1 year ago

I can't get a custom search.xml page working for a corpus I created called test. I copied the search.xml from opt/tomcat/apache-tomcat-9.0.78/webapps/corpus-frontend-3.1.0/WEB-INF/classes/interface-default, then made a few changes to it, and put it in

/corporaInterfaceDataDir
  /test
    search.xml
    /static

I then changed the corpus-frontend-3.1.0.properties file by commenting out corporaInterfaceDataDir=/etc/blacklab/projectconfigs/ and adding corporaInterfaceDataDir=/etc/blacklab/corporaInterfaceDataDir/test/.

When I restart the frontend, it still uses the default search.xml. What am I doing wrong?

KCMertens commented 1 year ago

adding corporaInterfaceDataDir=/etc/blacklab/corporaInterfaceDataDir/test/.

Remove the trailing test/ and it should work for you :)

stcoats commented 1 year ago

Thanks, that works! I'm using customJS to make some minor changes as per the instructions at https://github.com/INL/corpus-frontend#custom-js. Most javascript seems to work, including example vue.js functions provided on the page. I can't get "Customize the display of document titles in the results table" to work.

Here's my metadata structure:

<?xml version="1.0" ?>
<root>
<document>
<metadata id="id1">
<meta name="video_title">test_video1</meta>
(other metadata fields)
</metadata>
...

I tried putting the default

vuexModules.ui.getState().results.shared.getDocumentSummary = function(metadata, specialFields) {
  return 'The document is: ' + metadata[specialFields.titleField][0];
}

and

vuexModules.ui.getState().results.shared.getDocumentSummary = function(metadata, specialFields) {
  return 'The document is: ' + metadata[specialFields.video_title][0];
}

in the function, but on the page it displays

The document is: /path/to/indexed/xml/files/test_xml_1.xml above each snippet.

KCMertens commented 1 year ago

Getting it directly from the metadata object should work: 'The document is: ' + metadata.video_title[0]. The second snippet probably crashes, and that's why you don't see anything happen, check the javascript console and it'll probably show an exception.

SpecialFields only contains the names of fields in metadata, not the actual metadata itself: check it out

stcoats commented 1 year ago

Sorry to keep asking questions! I can't get these examples at https://github.com/INL/corpus-frontend#custom-js to work: "A table with whatever data you wish to show", "A pie chart displaying the frequency of an annotation's values", and "A graph showing growth of annotations in the document".

I copy-pasted the three code blocks into custom.js, but nothing happens. The console says

custom.js?_1378044326:39 Uncaught TypeError: vuexModules.root.actions.distributionAnnotation is not a function
    at custom.js?_1378044326:39:26

Just including one of the code blocks also fails, with a similar error message in the console.

I am probably overlooking something obvious!

KCMertens commented 1 year ago

You're probably running the same script on both the search and docs page, right? The pages don't have the same customization options, so that function will be undefined on one of the two pages. You'll have to add checks, or split up your custom.js into two scripts and only include them on their specific page. Here's how to do that:

split your javascript into two files: custom.search.js (for the /search page) and custom.article.js (for the /docs/... page). Adding the snippet for the document stuff in custom.article.js.

edit search.xml to conditionally include either file:

    <CustomJs page="search">${request:corpusPath}/static/js/custom.search.js</CustomJs>
    <CustomJs page="article">${request:corpusPath}/static/js/custom.article.js</CustomJs>

If it still crashes, let me know, in that case I'll need to investigate a little deeper.

stcoats commented 1 year ago

Thanks, that worked.

Regarding the attempt to play an audio file in the results table for each hit, I discarded the first proposed xml structure I suggested on the basis of Jan's response:

you will need to index any required information (audio file name/number, start/end timecodes) as annotations for each word

The longer reply you wrote then explains how one might set things up for a link in the externalMetadata field to one audio file per document. However, I have many audio files per document, and I re-wrote my file converter to restructure the xml files. Now, in the structure, below, each <s> tag corresponds to one short audio file stored at a cdn, which contains all of the words in that sentence. So, for example, id1 would be the identifier for a .wav of the speaker saying "I felt good", and id2 for a clip of the speaker saying "You felt good", and so on.

<document>
<metadata>
...
</metadata>
<text>
<s id = "id1">
<w xml:id="w.1" pos="PRP" lemma="I">i</w>
<w xml:id="w.2" pos="VBD" lemma="feel">felt</w>
<w xml:id="w.3" pos="JJ" lemma="good">good</w>
</s>
<s id = "id2">
<w xml:id="w.4" pos="PRP" lemma="you">you</w>
<w xml:id="w.5" pos="VBD" lemma="feel">felt</w>
<w xml:id="w.6" pos="JJ" lemma="bad">good</w>
</s>
</text>
</document>

Is it possible to adapt the strategy you suggested for a structure like this? I don't need start and end times for the audio clips, at least for now, because I would like to play the audio for the entire sentence for a hit on any word in that sentence. I would like https://mycdn.com/path/id1.wav to be retrieved for the words with xml:id="w.1" or "w.2" or "w.3", and https://mycdn.com/path/id2.wav to be retrieved for the ids "w.4", "w.5", "w.6", etc.

I know almost nothing about XSLT, but could the line in your example <xsl:variable name="audiofile" select=".//externalMetadata/@id"/> be changed to <xsl:variable name="audiofile" select="../s/@id"/> or something along those lines, i.e. grab the parent element of the word hit and get the id attribute from that?

As far as I understood, Jan's suggestion was to re-do the xml files to have the annotation for the audio file included for each word. For my data, that would look something like this:

<s id = "id1">
<w xml:id="id1.w.1" pos="PRP" lemma="I">i</w>
<w xml:id="id1.w.2" pos="VBD" lemma="feel">felt</w>
<w xml:id="id1.w.3" pos="JJ" lemma="good">good</w>
</s>
<s id = "id2">
<w xml:id="id2.w.4" pos="PRP" lemma="you">you</w>
<w xml:id="id2.w.5" pos="VBD" lemma="feel">felt</w>
<w xml:id="id2.w.6" pos="JJ" lemma="bad">good</w>
</s>
</text>
</document>

A possible problem with this is that the identifying annotation codes for the audio files are quite long (e.g. f15-GX8-qszPE_0003301000290812_127), and each sentence can contain many words. Would this not make the size of the index (and entire installation) significantly greater? If not, because Lucene can handle that easily, then perhaps that is the best way to go?

Thanks once again for your willingness to help a neophyte with no development experience!

jan-niestadt commented 1 year ago

Putting the sentence id in every word id would work, but isn't necessary. Your suggestion to "grab" the sentence id from the <s/> tag while indexing is the right approach, I think, and at first glance, the XPath expression ../s/@id seems like it should work. Good luck!

stcoats commented 1 year ago

Hello again.

I can't get an audio file audio to play if I use this javascript in custom.search.js:

vuexModules.ui.getState().results.hits.addons.push(function(context) {
  return {
      component: 'AudioPlayer', // don't change this!
      name: 'audio-player', // this may be whatever
      props: {
          docId: context.docId, // for caching
          startTime: 0,
          endTime: Number.MAX_SAFE_INTEGER, // since we don't have a defined endtime, just set a high number
          url: `https://mycdn.com/${context.document.audiofile[0]}`
      },
  }
})

If I change url: `https://mycdn.com/${context.document.audiofile[0]}` to a fixed url (like url: `https://mycdn.com/file1.mp3`), it plays. I've tried ${context.document.audiofile}, ${context.text.audiofile[0]}, ${context.text.audiofile}, ${context.audiofile}, ${context.document.text.audiofile}etc.

What should the variable be?

In my bfl.yaml I have

annotatedFields:

 contents:
    containerPath: text
    wordPath: .//w

    annotations:
    - name: word
      valuePath: .
      sensitivity: sensitive_insensitive

    - name: lemma
      valuePath: "@lemma"
      sensitivity: sensitive_insensitive

    - name: pos
      valuePath: "@pos"

    - name: audiofile
      valuePath: "../@id"

    inlineTags:
      - path: .//s

The code seems to be fetching the correct variable into the DOM. But I don't know what the correct variable is to get it out of there.

KCMertens commented 1 year ago

Instead of ${context.document.audiofile[0]}, try ${context.snippet.match.audiofile[0]}.

Looking at your blf.yaml and your last screenshot, the audio file name is actually stored per-word, not in the document metadata. In that console log, anything in docInfos is document metadata, anything in hits is per-word information. Note the context object in javascript looks slightly different:

type context = {
    corpus: string, 
    docId: string, // the document id
    snippet: BLTypes.BLHitSnippet, // the raw hit info as returned by blacklab
    document: BLTypes.BLDocInfo, // the document metadata (just a key-value map of all metadata, values contained in arrays!)
    documentUrl: string, // url to view the document in the corpus-frontend
    wordAnnotationId: string, // configured annotation to display for words (aka vuexModules.ui.results.hits.wordAnnotationId)
    dir: 'ltr'|'rtl',
    citation: {
      left: string;
      hit: string;
      right: string;
    }
  }

Thanks for the detailed questions by the way, very helpful!

stcoats commented 1 year ago

Awesome, thank you, and also for the structure of the context object, I am learning a lot! With the change in the javascript function you suggested, the button now plays the desired clip in the results table.

However, if I play a hit, then move down in the table and click the play button for a different hit, it plays the same clip. If I refresh the page and go directly to the second hit, it plays the correct clip.

Is there a way to automatically refresh this, so that when staying on the same results page, one can play different clips?

KCMertens commented 1 year ago

Uh, woops, that's a bug! It seems we're caching audio players by their docId, which makes no sense, it should just cache based on the url. I'll get that fixed, but for now, you can work around that by just passing the url in props.docId.

stcoats commented 1 year ago

Is this what you mean?

vuexModules.ui.getState().results.hits.addons.push(function(context) {
  return {
      component: 'AudioPlayer', // don't change this!
      name: 'audio-player', // this may be whatever
      props: {
          docId: `https://mycdn.com/${context.snippet.match.audiofile[0]}`,
          startTime: 0,
          endTime: 999999999 
      },
  }
})

If I do this, clicking on the play button doesn't play the audio.

KCMertens commented 1 year ago

vuexModules.ui.getState().results.hits.addons.push(function(context) {
  return {
      component: 'AudioPlayer', // don't change this!
      name: 'audio-player', // this may be whatever
      props: {
          docId: `https://mycdn.com/${context.snippet.match.audiofile[0]}`,
                  url: `https://mycdn.com/${context.snippet.match.audiofile[0]}`,
          startTime: 0,
          endTime: 999999999 
      },
  }
})

stcoats commented 1 year ago

Perfect, thank you very much! I am still in the process of testing things out, but with the help of you and Jan I have established the basic functionality I need. I'll be back with more questions later. 😎

INL / corpus-frontend

Linking to audio files in an external CDN #444