Lucene indexing for free form text

RestfulBlue commented 6 years ago

Currently druid uses classic inverted indexes to index string columns. But it not really useful when using free-form text. Currently possible to disable indexes, to have no overhead of such columns, but will be very useful to have possibility to enable full text search. For example, setup in configuration like this

    {
      "type": "string",
      "name": "additional_info",
      "indexType": "unindexed" // without bitmap
    },
    {
      "type": "string",
      "name": "hostname",
      "indexType": "default" // current inverted index
    },
    {
      "type": "string",
      "name": "log_record",
      "indexType": "lucene" // lucene indexing
    }

with this possibility druid can be used to store almost everything related to monitoring and log data, making possible to get fast result for query like this :

select 
   time_floor(__time, "PT1H") , count(*)
from 
   system_logs
where 
    log_record satisfy "*something*" 
    and hostname = "node1"
group by
    time_floor(__time, "PT1H")
order by
   time_floor(__time, "PT1H")

where satisfy apply lucene filter log_record:something.

Adding full text search will made druid universal instrument for monitoring and logging different systems(Currently filter by free form text require almost full scan, which not work well, so necessary to store such data in solr or elasticsearch)

gianm commented 6 years ago

Hi @RestfulBlue, it sounds like an interesting idea. Are you imagining adding a Lucene index as a companion to a Druid segment (i.e. adding one as a new column, maybe) or as an alternate format (a new type of StorageAdapter)?

One other thing you could look at is using Druid's multivalue dimensions. The idea would be to support text search by tokenizing input fields into arrays and storing them as multivalue dimensions. Then, you can do a search by tokening the search string the same way and retrieving the relevant terms from the inverted index of the multivalue dimension.

What do you think?

RestfulBlue commented 6 years ago

Hi , multivalue dimensions will work only in some generic simple case, for example where logs have simple form with space separated words. But even with this form of data, it need external preprocessing, which will be grow with time. For example by first it just split by space, when we realize we also want to split by all special characters, when we realize what we also want to search by part of word, so we k skip n gramm, etc. With what external preprocessing will slowly move to things, what lucene doing. Also with what we cant simply get source text, for like select * from table limit 100, because data in multivalue column splitted and optimized for search. So this requiere denormalization of data and cost additional space.

Simple lucene indexing looks like this :

   Analyzer analyzer = new StandardAnalyzer();

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();

    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();

i think adding it as new column will be great. The main reason is what lucene is more heavy than simple token indexing. Mixing disabled indexing, tokening and lucene in one table can greatly reduce total amount of required disk space compare to full lucene indexing

navis commented 6 years ago

I've tried text indexing using lucene (https://www.slideshare.net/navis94/druid-meetup-5th/1). Wish it's helpful to you. Also it's supposed to be presented in ApacheCon NA about geospatial processing on druid with lucene (https://apachecon.dukecon.org/acna/2018/#/scheduledEvent/9d31a2c8e70fc2435).

RestfulBlue commented 6 years ago

@navis sadly, slideshare is banned in russia :C

stale[bot] commented 5 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

RobotCharlie commented 1 year ago

@RestfulBlue . I don't know your exact use case but I am able to have a work around by using CONTAINS_STRING scalar function.

See below example,

HAVING CONTAINS_STRING(log_record, 'something')

You might be able to have a work around for your use cases by using some other scalar functions https://druid.apache.org/docs/latest/querying/sql-scalar.html

apache / druid

Lucene indexing for free form text #6189