Indexing file content in Solr

Islandora / documentation

Contains islandora's documentation and main issue queue.

MIT License

104 stars 71 forks source link

Indexing file content in Solr #1043

Open Natkeeran opened 5 years ago

Natkeeran commented 5 years ago

One of the powerful features of Islandora 7.x is its ability to index content in the datastreams. In Islandora 8, we can index field information in content types. However, there is no prescribed way to index file content (ex xml, json files). What is the approach that will be taken to support this feature in Islandora 8?

seth-shaw-unlv commented 5 years ago

Two possible strategies,

1) have a file->text extractor service that can update a "transcript" field on the metadata node. (This is what we were planning on.) 2) Index the files/media as independent entities that are returned in search results with links to the related item. It appears search_api_attachments (D8 version still in beta) does this, but it presumes a Node -> Media relationship rather than the Media -> Node relationship we use.

dannylamb commented 5 years ago

FWIW I was considering @seth-shaw-unlv's first strategy for stuff like (H)OCR and transcripts. Having that as a field you don't display but index is hands down the simplest way to go. Make an action to run in response to updates of a media and have it dump its contents into a field.

For something that would require a transform I'm less certain as to how it would play out. You could transform in Drupal with Twig templates and json or make a microservice so you're no longer constrained by PHP's limited xml handling. It all depends on the use case, I guess.

Natkeeran commented 5 years ago

@dannylamb

For full text text, the first approach can work.

But, there are many modules/use cases out there that require transform (tei, oral history etc), thus having a way to support that would be helpful.

whikloj commented 5 years ago

@Natkeeran could you flesh out the requirement of transform a little bit? I am unclear on how you would use TEI in an Islandora 8 context.

Natkeeran commented 5 years ago

@whikloj In 7.x, you can use custom xslts to index TEI elements into solr. Those solr fields can then be searched and faceted in Drupal via Islandora Solr Search. We are using in this feature in several places:

oral histories indexes cues in solr, which are then queired for display
we use solr to bring back audit, tech/fits and foxml info for reporting
tei are index as noted above
vtt, and annotations are indexed into solr in similar way as well

Though we may not need all the above use cases in 8.x, the question remains if we need a generic way to index media/datastreams in solr then make them available for search, faceting etc in Drupal.

whikloj commented 5 years ago

My concern is thinking in 7.x terms for 8.

For instance (IMHO) media !== datastream, more media & file == datastream but even that seems a little wrong as a datastream in Fcrepo 3 only has one parent. In Drupal 8 we could have multiple content nodes pointing to the same file with separate media entities.

Maybe we need some sort of special entity to store file information. These entities would reference a file and could contain the FITS type data. If more than one node references the file, this data is still only stored once and perhaps not as XML.

Could we convert it to some usable JSON that would be easier to work with. This data is meant to be machine readable.

I guess what I'm saying is that most people in the Islandora 7.x world have trouble with and then learn to hate the XSLTs. So I think it might be nice to dump them.

But I'm good with XSLTs, so I can go either way.