dzlab commented 9 years ago

I'm trying to use Doradus server, i've setup the server correctly (it's up and running). I've a problem when I try to use the doradus-client to push a tsv file:

When I launch the client I don't see a problem in the output but actually there is no data been pushed:

Process finished with exit code 0

But when I check the logs from the doradus-server, I only see the creation of schema:

17:32:48,342 INFO SchemaService: Defining application: BIM
17:32:48,344 INFO CassandraSchemaMgr: Creating ColumnFamily: Doradus:BIM_sample
17:32:48,454 INFO CassandraSchemaMgr: Creating ColumnFamily: Doradus:BIM_sample_Terms

When I check the urls, there is nothing:

GET http://localhost:1123/_tasks
GET http://localhost:1123/_olapp
Applications    Shards  Tables

The application is actually created:

GET http://localhost:1123/_applications
  <application name="BIM">
    <option name="StorageService">SpiderService</option>
    <option name="AutoTables">true</option>
    <table name="sample"/>

But no docs:

GET http://localhost:1123/BIM/sample/_query?q=*

On debug mode, when I check the BatchResult I see a 1000 ObjectResult that have the following error Invalid field name: .... This fields is a Double and it is correctly set in the JSON schema as well as in the CSV file header. Is there any naming constraints on scalar fields?

RandyGuck commented 9 years ago

Sorry for the late reply--just got back from vacation. You're right that the CSVLoader only handles comma separators, and it requires a first row with column names. But you can set the application name via the "-app " parameter.

Since you're not explicitly defining any fields, every field will be loaded as text, so every value should be accepted. However, field (column) names must follow identifier rules: first character must be a letter; all other characters must be letters or digits or underscores; names are case-sensitive.

If that doesn't explain what's happening, post a few lines of a typical input file and I'll see if I can debug it.

dzlab commented 9 years ago

Thanks for the reply but what about the misleading logs CSVLoader: ...loaded 10000 records. while there were nothing really uploaded as of the Invalid field name: ... error (which should be logged)?

Also, I'm having really bad performance (several minutes, I think around 30mn) when ingesting a dataset of 0.3M doc each has 93 columns (no links, only scalar fields). I've split the data on many batchs of 10k docs and use finagle-http (I've a scala client) to send the requests (json). Queries are also slow, A distinct aggregate on one attribute takes around 40 (both Cassandra and Doradus server run on mac book).

p.s. Hope you had great time.

RandyGuck commented 9 years ago

I just fixed one issue with the CSVLoader: when one object in the batch fails, the overall batch status is set to "warning", and warning-only batches weren't getting reported. If there's an invalid field name or value, you should see CSVLoader report these now.

The misleading progress reporting (...loaded xxx records) is also fixable with more work. The problem is that this log message is generated by the main thread as it parses and queues reports for worker threads. At the time of reporting, it doesn't know how many records actually succeeded. But this could be changed to something like "...queued xxx records, yyy loaded successful." I'll take a look at that next week.

dzlab commented 9 years ago

These changes will be available in next release? What about the second part of my previous comment on how Doradus is performing?

RandyGuck commented 9 years ago

These changes are in the master branch, so you can download and build it if you like. Otherwise, they will be in the next release, however we just created the v4.2 release and probably won't create another one for a while.

As for the performance issues: Spider databases are OK for moderate data volumes (millions of objects, but not billions) and moderate query requirements. It uses fully inverted indexing, so update performance is proportional to the number of fields indexed. Text fields generate the most mutations since they are parsed into terms. What kind of load rate (objs/sec) are you seeing? The queries that Spider is best at are "needle in the haystack" queries such as finding objects where a field contains some term or falls in some range. Aggregate queries (COUNT(*)) will be the slowest queries. If you send me a sample query, can I take a look.

When high performance loading and fast aggregate queries are important, OLAP is much better, sometimes several orders of magnitude. It uses no indexes and columnar compression, so updates generate far fewer mutations, hence load rates are much higher. In queries, OLAP can scan millions of objects per second. OLAP of course requires that data can be organized into shards.

If your data is time-stamped, immutable, and doesn't require links, the new Logging service is even faster. It doesn't require shards, and it loads and queries data even faster than OLAP. I can point out some links for more information if you like.

dzlab commented 9 years ago

My data is timestamped and immutable, I have a set of dimensions and metrics, on which I want to do analytics (OLAP workload). I've a data set of around 0.3M row with 93 column each. I submit batches of 10K json document, each request take around 33.5s (mean). A distinct query GET http://localhost:1123/app_name/table_name/_aggregate?m=DISTINCT(field_name) takes around.

I've explicitly set the storage service option to OLAPService in the app schema that I submit to Doradus. But when checking my app schema on Doradus (i.e. GET http://localhost:1123/_applications) I see the storage service set to SpiderService! I don't know why (may be the server is not started with the OLAPService up, what's the default behaviour?) but this is definitely why the insert/query is so slow.

What about the Logging Service, it's not mentioned in the wiki, how to use it ?

RandyGuck commented 9 years ago

It sounds like the OLAP or Logging service would work much better for you. The Logging service is brand new and I'm still working on wiki pages/tutorials, but there is a PDF document for it located here: https://github.com/dell-oss/Doradus/blob/master/docs/Doradus%20Logging%20Database.pdf

dzlab commented 9 years ago

I'm trying to understand how Doradus stores its data into Cassandra, it looks like it creates a single SSTable with few row ids (36) I except around 0.3M as this is the size of my dataset. Also, it's not using any memtable neither it uses bloom filters!! I wonder how queries/aggregations can be fast then.

RandyGuck commented 9 years ago

OLAP uses columnar storage and various compression techniques to store data very compactly, so it doesn't use much disk. Here are some links to presentations that provide a little more insight on how OLAP works:

If you download the slides, the notes on each slide provide extra info. Hope this helps!