Load has problem on cluster with large data size

GoogleCodeExporter commented 8 years ago

Used Asterix Revision: Beta release 

Setting: 7 Nodes - 4 IO-Device each ~ 28-way partitioned (on Asterix cluster)
Goal: Loading facebookUsers dataset, which is several-100GBs in size

Observation: System starts loading process normally, as in the temp space, 
external sort runs are generated. After a while, system becomes idle. Using top 
command, no outstanding activity is running on NCs, while NC/CC logs are all 
clean, in the sense that there is no exception there.

How to reproduce: Same setting (7NCs, 4 IO-Device each) got repeated on a 
single machine, with an attempt to load ~4GB of data, split in 28 chunks. Same 
problem is seen. (Trying less NCs, and smaller Data Size (both in total size, 
and also number of chunks) led into success in loading, hence this problem 
*could* be related to number of NCs and/or data size).

Below you can see the DDL, and load statement. Cluster configuration file is 
attached (make sure to change paths, in case you wanna use them):

create dataverse SocialNetworkData;
use dataverse SocialNetworkData;

create type EmploymentType as open {

}

create type FacebookUserType as open {
id: int32,
id-copy: int32,
user-since-copy: datetime
}

create dataset FacebookUsers(FacebookUserType)
primary key id
hints(cardinality=12000000);

load dataset FacebookUsers using
"edu.uci.ics.asterix.external.dataset.adapter.NCFileSystemAdapter"
(("path"="localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu1.adm,
localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu2.adm,localhost:
///home/getafix/pouria/Asterix/load-issue/raw-data/fbu3.adm,localhost:///home/ge
tafix/pouria/Asterix/load-issue/raw-data/fbu4.adm,localhost:///home/getafix/pour
ia/Asterix/load-issue/raw-data/fbu5.adm,localhost:///home/getafix/pouria/Asterix
/load-issue/raw-data/fbu6.adm,localhost:///home/getafix/pouria/Asterix/load-issu
e/raw-data/fbu7.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw-data
/fbu8.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu9.adm,
localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu10.adm,localhost
:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu11.adm,localhost:///home/
getafix/pouria/Asterix/load-issue/raw-data/fbu12.adm,localhost:///home/getafix/p
ouria/Asterix/load-issue/raw-data/fbu13.adm,localhost:///home/getafix/pouria/Ast
erix/load-issue/raw-data/fbu14.adm,localhost:///home/getafix/pouria/Asterix/load
-issue/raw-data/fbu15.adm,localhost:///home/getafix/pouria/Asterix/load-issue/ra
w-data/fbu16.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fb
u17.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu18.adm,l
ocalhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu19.adm,localhost:
///home/getafix/pouria/Asterix/load-issue/raw-data/fbu20.adm,localhost:///home/g
etafix/pouria/Asterix/load-issue/raw-data/fbu21.adm,localhost:///home/getafix/po
uria/Asterix/load-issue/raw-data/fbu22.adm,localhost:///home/getafix/pouria/Aste
rix/load-issue/raw-data/fbu23.adm,localhost:///home/getafix/pouria/Asterix/load-
issue/raw-data/fbu24.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw
-data/fbu25.adm,localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu
26.adm,
localhost:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu27.adm,localhost
:///home/getafix/pouria/Asterix/load-issue/raw-data/fbu28.adm"),("format"="adm")
);

Original issue reported on code.google.com by pouria.p...@gmail.com on 19 Jun 2013 at 12:13

Attachments:

local.xml

GoogleCodeExporter commented 8 years ago

Pouria, 

It seems the issue is in one of the file names that you are using. 
Specifically, you have a typo in the name of the file number 27: "fbu27..adm" 
which exists in the raninbow cluster.
As you can see you have one additional dot. If you remove it, the load succeed.

Now, why the system was in a hung state and didn't report an error, this is a 
different issue. It seems the exception is swallowed somewhere.

Re-assigning this to Raman since this is now related to the external data scan 
operator.

Original comment by salsuba...@gmail.com on 19 Jun 2013 at 6:34

GoogleCodeExporter commented 8 years ago

Sattam,

That is just a typo, that happened when I copied the DDL to this issue. If you 
fix that *typo*, again it is not working. For that typo system reports the 
error properly as "file not found" and there is no hung.

Original comment by pouria.p...@gmail.com on 19 Jun 2013 at 7:00

GoogleCodeExporter commented 8 years ago

Copying this from the email thread for documentation purpose:
Then I'm not able to reproduce it. 
When I fix the typo, the load succeed. I'm using the same settings that you 
indicated in the issue description (7 NCs, 4 IO devices, and 28 file splits). 
Actually, I'm using the same xml configuration file except changing the paths.

Original comment by salsuba...@gmail.com on 19 Jun 2013 at 7:13

GoogleCodeExporter commented 8 years ago

BTW, the typo is in the actual file name (the one that exists in the rainbow 
cluster in this location: /data/pouria/load-issue-data/fbu/fbu27..adm) and not 
the DDL statement. The DDL statement has no typos.

Are we on the same page?

Original comment by salsuba...@gmail.com on 19 Jun 2013 at 7:16

GoogleCodeExporter commented 8 years ago

Based on comments 
#1, #2  
and last observation from Jarod:-
> This time all machines have enough space for the data. Each machine have
> 2.7TB available space. The experiment starts 9:48am, and since now the
> loading is still on-going, but all node controllers have no CPU activity (so
> I assume that the load is already failed). I tried to get the lines of the
> dataset to be updated, and it returns 0.

 it doesn't seem to be an issue with the path specified in the AQL. As reported,
 (i) correcting the path does not get rid of the issue. 
 (ii) incorrect path leads to exception stating "invalid path"
(iii) having sufficient space on disk still causes the issue. 
 (iii) the issue is consistently reproducible on a single node and on asterix-cluster

@Sattam, I believe the assertion in #1 is invalid. I would recommend debugging 
it on either the cluster or pouria's machine where the issue is consistently 
reproducible. 

I am sensing another issue here:- if the path is invalid, load statement does 
not give an exception but runs into a hung state. I am investigating the load 
statement with regards to incorrect paths.

Original comment by RamanGro...@gmail.com on 19 Jun 2013 at 7:23

GoogleCodeExporter commented 8 years ago

OK, here is what it is happening:

Pouria is using an accurate estimate of the dataset cardinailty which is 337M 
records. This hint is used by the indexes to create the bloom filters. Each 
record approximately requires 10 bits. Thus the total sizes of the bloom 
filters will be around 400MB. Note that all the pages of the bloom filters will 
be pinned in the buffer cache while the index is being loaded or open. The 
current default size of the  buffer cache in the XML configuration file is 32 
MB :-)

Now the issue is that we over estimate the size of the dataset size in each 
partition by using the total dataset cardinality which is an over kill. Thus if 
we have 7 NCs where each NC has 4 IO devices, each bloom filter will be of size 
400MB where it should be 14MB.

Obviously 400MB >> 32MB

Which will make the buffer cache spin looking for a victim page to flush to 
disk, but it cannot find any; therefore, the bulkload cannot progress.

To summarize:
1) We should use a better estimate for the cardinality of the dataset per 
partition based on number of NCs and number of IO devices.
2) We should change the default size of the buffer cache from 32MB to a higher 
value.

Original comment by salsuba...@gmail.com on 20 Jun 2013 at 11:19

GoogleCodeExporter commented 8 years ago

Original comment by salsuba...@gmail.com on 21 Jun 2013 at 4:58

Changed state: UnderReview

GoogleCodeExporter commented 8 years ago

I think we have a space management "design flaw" here.  IMO we should not put 
these bits in the buffer pool - the buffer pool is intended to serve as a cache 
for pages from disk that can come in and go out more fluidly.  These bits are 
analogous (again IMO) to the in-memory components, and should come out of that 
storage pool - part of the memory "price" of an LSM index is its in-memory 
component, which is pinned for the duration while it's open, and this seems 
like another memory price component.

Original comment by dtab...@gmail.com on 22 Jun 2013 at 8:23

GoogleCodeExporter commented 8 years ago

Well, it depends on how you look at it.
The purpose of the bloom filter is to avoid going to disk when answering a
query by doing caching. The bloom filter pages will be kicked out from the
buffer cache when the index is closed. As long as the index is open, then
the assumption is that the index is going to be searched, thus we keep the
filter's pages pinned in the cache.
So I don't really see how this break the purpose of the buffer pool.

Besides that, if you think about it, the in memory buffer cache is not
really a cache, in the sense that we maintain inserts in the in memory for
the purpose of having high write throughput and not for the purpose of
serving search queries faster. While the bloom filter purpose is totally
different, it is used to make search queries run faster.

That being said, as Vinayak pointed out yesterday, we should have the
ability to do something when we hit this scenario, by extending the size of
the buffer cache, or maybe throwing an exception if we keep spinning
looking for a victim page.

Does that make sense to you?

Sattam

Original comment by salsuba...@gmail.com on 22 Jun 2013 at 9:09

GoogleCodeExporter commented 8 years ago

Original comment by salsuba...@gmail.com on 26 Jun 2013 at 5:40

Changed state: Fixed

br1ghtyang / asterixdb

Load has problem on cluster with large data size #534