Loaded data from csv file with doradus-client is not available for query

dzlab commented 9 years ago

I'm trying to use Doradus server, i've setup the server correctly (it's up and running). I've a problem when I try to use the doradus-client to push a tsv file:

It doesn't support other separators than , so I've modified the source to support \t,
The file needs a header (even if the schema is provided) so I've added a list of column names,
The application name is hard coded to BIM, so I've changed my app name to that,

When I launch the client I don't see a problem in the output but actually there is no data been pushed:

$ java -cp target/doradus-client-2.3.1.jar:target/dependency/* com.dell.doradus.client.utils.CSVLoader -schema BIM.json -delimiter \t
17:32:47,337 INFO CSVLoader: Deleting existing application: BIM
17:32:48,333 INFO CSVLoader: Creating application 'BIM' with schema: /Users/dzlab/Workspace/DB/Doradus/doradus-client/BIM.json
17:32:48,896 INFO CSVLoader: Starting 3 workers
17:32:48,922 INFO CSVLoader: Scanning for files in folder: /Users/dzlab/Workspace/DB/Doradus/.
17:32:48,922 INFO CSVLoader: Loading CSV file: /Users/dzlab/Workspace/DB/Doradus/./sample_withheader.csv
17:32:54,896 INFO CSVLoader: ...loaded 10000 records.
17:32:58,824 INFO CSVLoader: ...loaded 20000 records.
17:33:01,635 INFO CSVLoader: ...loaded 30000 records.
17:33:04,507 INFO CSVLoader: ...loaded 40000 records.
17:33:07,140 INFO CSVLoader: ...loaded 50000 records.
17:33:09,513 INFO CSVLoader: ...loaded 60000 records.
17:33:11,690 INFO CSVLoader: ...loaded 70000 records.
17:33:14,581 INFO CSVLoader: ...loaded 80000 records.
17:33:16,807 INFO CSVLoader: ...loaded 90000 records.
17:33:19,053 INFO CSVLoader: ...loaded 100000 records.
17:33:21,747 INFO CSVLoader: ...loaded 110000 records.
17:33:24,079 INFO CSVLoader: ...loaded 120000 records.
17:33:26,720 INFO CSVLoader: ...loaded 130000 records.
17:33:29,138 INFO CSVLoader: ...loaded 140000 records.
17:33:31,547 INFO CSVLoader: ...loaded 150000 records.
17:33:34,020 INFO CSVLoader: ...loaded 160000 records.
17:33:36,602 INFO CSVLoader: ...loaded 170000 records.
17:33:38,899 INFO CSVLoader: ...loaded 180000 records.
17:33:41,408 INFO CSVLoader: ...loaded 190000 records.
17:33:43,791 INFO CSVLoader: ...loaded 200000 records.
17:33:46,367 INFO CSVLoader: ...loaded 210000 records.
17:33:49,911 INFO CSVLoader: ...loaded 220000 records.
17:33:52,121 INFO CSVLoader: ...loaded 230000 records.
17:33:54,300 INFO CSVLoader: ...loaded 240000 records.
17:33:56,982 INFO CSVLoader: ...loaded 250000 records.
17:33:59,444 INFO CSVLoader: ...loaded 260000 records.
17:34:02,955 INFO CSVLoader: ...loaded 270000 records.
17:34:08,699 INFO CSVLoader: ...loaded 280000 records.
17:34:11,146 INFO CSVLoader: ...loaded 290000 records.
17:34:14,034 INFO CSVLoader: ...loaded 300000 records.
17:34:17,071 INFO CSVLoader: ...loaded 310000 records.
17:34:17,794 INFO CSVLoader: File '/Users/dzlab/Workspace/DB/Doradus/./sample_withheader.csv': time=88867 millis; lines=313000
17:34:18,418 INFO CSVLoader: Total files loaded:    1
17:34:18,418 INFO CSVLoader: Total lines scanned:   313000
17:34:18,418 INFO CSVLoader: Total bytes read:      158616510
17:34:18,418 INFO CSVLoader: Total time for load:   1 minute, 29 seconds
17:34:18,418 INFO CSVLoader: Average lines/sec:     3497
17:34:18,418 INFO CSVLoader: Average bytes/sec:     1772330

Process finished with exit code 0

But when I check the logs from the doradus-server, I only see the creation of schema:

...
17:32:48,342 INFO SchemaService: Defining application: BIM
17:32:48,344 INFO CassandraSchemaMgr: Creating ColumnFamily: Doradus:BIM_sample
17:32:48,454 INFO CassandraSchemaMgr: Creating ColumnFamily: Doradus:BIM_sample_Terms

When I check the urls, there is nothing:

GET http://localhost:1123/_tasks
<tasks/>
GET http://localhost:1123/_olapp
Applications    Shards  Tables

The application is actually created:

GET http://localhost:1123/_applications
<applications>
  <application name="BIM">
  <key>BIM_key</key>
  <options>
    <option name="StorageService">SpiderService</option>
    <option name="AutoTables">true</option>
  </options>
  <tables>
    <table name="sample"/>
  </tables>
</application>

But no docs:

GET http://localhost:1123/BIM/sample/_query?q=*
<results>
  <docs/>
</results>

On debug mode, when I check the BatchResult I see a 1000 ObjectResult that have the following error Invalid field name: .... This fields is a Double and it is correctly set in the JSON schema as well as in the CSV file header. Is there any naming constraints on scalar fields?

RandyGuck commented 9 years ago

Sorry for the late reply--just got back from vacation. You're right that the CSVLoader only handles comma separators, and it requires a first row with column names. But you can set the application name via the "-app " parameter.

Since you're not explicitly defining any fields, every field will be loaded as text, so every value should be accepted. However, field (column) names must follow identifier rules: first character must be a letter; all other characters must be letters or digits or underscores; names are case-sensitive.

If that doesn't explain what's happening, post a few lines of a typical input file and I'll see if I can debug it.

dzlab commented 9 years ago

Thanks for the reply but what about the misleading logs CSVLoader: ...loaded 10000 records. while there were nothing really uploaded as of the Invalid field name: ... error (which should be logged)?

Also, I'm having really bad performance (several minutes, I think around 30mn) when ingesting a dataset of 0.3M doc each has 93 columns (no links, only scalar fields). I've split the data on many batchs of 10k docs and use finagle-http (I've a scala client) to send the requests (json). Queries are also slow, A distinct aggregate on one attribute takes around 40 (both Cassandra and Doradus server run on mac book).

p.s. Hope you had great time.

RandyGuck commented 9 years ago

I just fixed one issue with the CSVLoader: when one object in the batch fails, the overall batch status is set to "warning", and warning-only batches weren't getting reported. If there's an invalid field name or value, you should see CSVLoader report these now.

The misleading progress reporting (...loaded xxx records) is also fixable with more work. The problem is that this log message is generated by the main thread as it parses and queues reports for worker threads. At the time of reporting, it doesn't know how many records actually succeeded. But this could be changed to something like "...queued xxx records, yyy loaded successful." I'll take a look at that next week.

dzlab commented 9 years ago

These changes will be available in next release? What about the second part of my previous comment on how Doradus is performing?

RandyGuck commented 9 years ago

These changes are in the master branch, so you can download and build it if you like. Otherwise, they will be in the next release, however we just created the v4.2 release and probably won't create another one for a while.

As for the performance issues: Spider databases are OK for moderate data volumes (millions of objects, but not billions) and moderate query requirements. It uses fully inverted indexing, so update performance is proportional to the number of fields indexed. Text fields generate the most mutations since they are parsed into terms. What kind of load rate (objs/sec) are you seeing? The queries that Spider is best at are "needle in the haystack" queries such as finding objects where a field contains some term or falls in some range. Aggregate queries (COUNT(*)) will be the slowest queries. If you send me a sample query, can I take a look.

When high performance loading and fast aggregate queries are important, OLAP is much better, sometimes several orders of magnitude. It uses no indexes and columnar compression, so updates generate far fewer mutations, hence load rates are much higher. In queries, OLAP can scan millions of objects per second. OLAP of course requires that data can be organized into shards.

If your data is time-stamped, immutable, and doesn't require links, the new Logging service is even faster. It doesn't require shards, and it loads and queries data even faster than OLAP. I can point out some links for more information if you like.

dzlab commented 9 years ago

My data is timestamped and immutable, I have a set of dimensions and metrics, on which I want to do analytics (OLAP workload). I've a data set of around 0.3M row with 93 column each. I submit batches of 10K json document, each request take around 33.5s (mean). A distinct query GET http://localhost:1123/app_name/table_name/_aggregate?m=DISTINCT(field_name) takes around.

I've explicitly set the storage service option to OLAPService in the app schema that I submit to Doradus. But when checking my app schema on Doradus (i.e. GET http://localhost:1123/_applications) I see the storage service set to SpiderService! I don't know why (may be the server is not started with the OLAPService up, what's the default behaviour?) but this is definitely why the insert/query is so slow.

What about the Logging Service, it's not mentioned in the wiki, how to use it ?

RandyGuck commented 9 years ago

It sounds like the OLAP or Logging service would work much better for you. The Logging service is brand new and I'm still working on wiki pages/tutorials, but there is a PDF document for it located here: https://github.com/dell-oss/Doradus/blob/master/docs/Doradus%20Logging%20Database.pdf

dzlab commented 9 years ago

I'm trying to understand how Doradus stores its data into Cassandra, it looks like it creates a single SSTable with few row ids (36) I except around 0.3M as this is the size of my dataset. Also, it's not using any memtable neither it uses bloom filters!! I wonder how queries/aggregations can be fast then.

$ /bin/nodetool --host localhost cfstats
Keyspace: Doradus
    Read Count: 0
    Read Latency: NaN ms.
    Write Count: 0
    Write Latency: NaN ms.
    Pending Flushes: 0
        Table: Applications
        SSTable count: 1
        Space used (live): 8164
        Space used (total): 8164
        Space used by snapshots (total): 0
        Off heap memory used (total): 23
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 1
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 16
        Bloom filter off heap memory used: 8
        Index summary off heap memory used: 15
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 3312
        Compacted partition maximum bytes: 3973
        Compacted partition mean bytes: 3973
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

        Table: OLAP
        SSTable count: 1
        Space used (live): 19850466
        Space used (total): 19850466
        Space used by snapshots (total): 0
        Off heap memory used (total): 153
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 36
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 56
        Bloom filter off heap memory used: 48
        Index summary off heap memory used: 105
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 61
        Compacted partition maximum bytes: 654949
        Compacted partition mean bytes: 571379
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

        Table: Tasks
        SSTable count: 1
        Space used (live): 5128
        Space used (total): 5128
        Space used by snapshots (total): 0
        Off heap memory used (total): 36
        SSTable Compression Ratio: 0.0
        Number of keys (estimate): 2
        Memtable cell count: 0
        Memtable data size: 0
        Memtable off heap memory used: 0
        Memtable switch count: 0
        Local read count: 0
        Local read latency: NaN ms
        Local write count: 0
        Local write latency: NaN ms
        Pending flushes: 0
        Bloom filter false positives: 0
        Bloom filter false ratio: 0.00000
        Bloom filter space used: 16
        Bloom filter off heap memory used: 8
        Index summary off heap memory used: 28
        Compression metadata off heap memory used: 0
        Compacted partition minimum bytes: 36
        Compacted partition maximum bytes: 258
        Compacted partition mean bytes: 150
        Average live cells per slice (last five minutes): 0.0
        Maximum live cells per slice (last five minutes): 0
        Average tombstones per slice (last five minutes): 0.0
        Maximum tombstones per slice (last five minutes): 0

RandyGuck commented 9 years ago

OLAP uses columnar storage and various compression techniques to store data very compactly, so it doesn't use much disk. Here are some links to presentations that provide a little more insight on how OLAP works:

Strata presentation: http://www.slideshare.net/randyguck/strata-presentation-doradus
O'Reilly webinar: http://www.slideshare.net/randyguck/extending-cassandra-with-doradus-olap-for-high-performance-analytics

If you download the slides, the notes on each slide provide extra info. Hope this helps!

QSFT / Doradus

Loaded data from csv file with doradus-client is not available for query #10