cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.11k stars 604 forks source link

Lost of network on GCE #379

Closed asias closed 3 years ago

asias commented 10 years ago

Gleb reported:

> 1) How to reproduce the issue ?
Run cassandra on osv vm with 4 cpu and 3.8 G memory.
Run ycsb from Linux vm (https://github.com/cloudius-systems/osv/wiki/Benchmarking-Cassandra-and-other-NoSQL-databases-with-YCSB).
I usually ran workloadf with -threads 100 -p operationcount=1000000 -p recordcount=800000 so my load command line
looks like that:
 ./bin/ycsb load cassandra-10 -threads 100 -p operationcount=1000000 -p recordcount=800000 -p hosts=osv-vm-ip -P workloads/workloadf -s
When you see that ops per second become 0 check connectivity to osv vm
with ping.

> 2) What you have tried so far ?
>
I added debug output (patch attached). Debug output shows that when
networking is lost ping packets are still received by affected vm.
FWIW it looks like with the patch the problem reproduces more often.
``
syuu1228 commented 10 years ago

I had similar experience when I tried to connect httpserver(REST) and Ruby on Rails.

ping works without issue, but TCP communication is unstable. Able to establish connection, though.

After I did git reset --hard c1c6054f2415f58f87b2f98db8aa45fb1fa7f3fa It starts working without problem.

Maybe related to "[osv] TCP retransmission with 'net: use rcu_hashtable in net channel classifier' (#378)" ?

On Sat, Jul 12, 2014 at 7:00 AM, Asias He notifications@github.com wrote:

Gleb reported:

1) How to reproduce the issue ? Run cassandra on osv vm with 4 cpu and 3.8 G memory. Run ycsb from Linux vm (https://github.com/cloudius-systems/osv/wiki/Benchmarking-Cassandra-and-other-NoSQL-databases-with-YCSB). I usually ran workloadf with -threads 100 -p operationcount=1000000 -p recordcount=800000 so my load command line looks like that: ./bin/ycsb load cassandra-10 -threads 100 -p operationcount=1000000 -p recordcount=800000 -p hosts=osv-vm-ip -P workloads/workloadf -s When you see that ops per second become 0 check connectivity to osv vm with ping.

2) What you have tried so far ?

I added debug output (patch attached). Debug output shows that when networking is lost ping packets are still received by affected vm. FWIW it looks like with the patch the problem reproduces more often. ``

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv/issues/379.

vladzcloudius commented 10 years ago

The issue refers a master 985d4759ba4061404063d6f39dcc6cf145f87021

vladzcloudius commented 10 years ago

Couldn't reproduce the network loss when following the instructions above. The Client side was receiving a broken-pipe signal and the test was failing but there were pings and the network was alive and kicking.

dorlaor commented 10 years ago

On Mon, Aug 4, 2014 at 2:15 PM, vladzcloudius notifications@github.com wrote:

It's been noticed that this test blows off the Guest storage quite quickly (in about 1100-1800 seconds of the test) and eventually causes the following cassandra assert:

You can limit the storage portion cassandra consumes by using a smaller

records but keep running lots of #operations.

For example, this survives infinite amount of time: /bin/ycsb run cassandra-10 -threads 100 -p operationcount=100000000 -p recordcount=800000 -p hosts=10.0.0.216 -P workloads/workloadf -s

ERROR 12:06:11,457 Exception in thread Thread[CompactionExecutor:18,1,main] FSWriteError in /var/lib/cassandra/data/usertable/data/usertable-data-tmp-jb-218-Filter.db at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:478) at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212) at org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:304) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:209) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.FileNotFoundException: /var/lib/cassandra/data/usertable/data/usertable-data-tmp-jb-218-Filter.db (No space left on device) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:110) at org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:469) ... 13 more

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv/issues/379#issuecomment-51047563 .

vladzcloudius commented 10 years ago

On 08/04/14 14:19, dorlaor wrote:

On Mon, Aug 4, 2014 at 2:15 PM, vladzcloudius notifications@github.com wrote:

It's been noticed that this test blows off the Guest storage quite quickly (in about 1100-1800 seconds of the test) and eventually causes the following cassandra assert:

You can limit the storage portion cassandra consumes by using a smaller

records but keep running lots of #operations.

For example, this survives infinite amount of time: /bin/ycsb run cassandra-10 -threads 100 -p operationcount=100000000 -p recordcount=800000 -p hosts=10.0.0.216 -P workloads/workloadf -s

Somehow this didn't run "infinite" time but just the same ~200 seconds:

[vladz@instance-1 ycsb-0.1.4]$ ./bin/ycsb load cassandra-10 -threads 500 -p hosts=10.240.212.125 -P workloads/workloadf -p operationcount=100000000 -p recordcount=800000 -s java -cp /home/vladz/ycsb-0.1.4/hbase-binding/conf:/home/vladz/ycsb-0.1.4/cassandra-binding/lib/cassandra-binding-0.1.4.jar:/home/vladz/ycsb-0.1.4/jdbc-binding/conf:/home/vladz/ycsb-0.1.4/nosqldb-binding/conf:/home/vladz/ycsb-0.1.4/core/lib/core-0.1.4.jar:/home/vladz/ycsb-0.1.4/infinispan-binding/conf:/home/vladz/ycsb-0.1.4/voldemort-binding/conf:/home/vladz/ycsb-0.1.4/gemfire-binding/conf com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.CassandraClient10 -threads 500 -p hosts=10.240.212.125 -P workloads/workloadf -p operationcount=100000000 -p recordcount=800000 -s -load YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.CassandraClient10 -threads 500 -p hosts=10.240.212.125 -P workloads/workloadf -p operationcount=100000000 -p recordcount=800000 -s -load Loading workload... Starting test. 0 sec: 0 operations; SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 10 sec: 65290 operations; 6512.07 current ops/sec; [INSERT AverageLatency(us)=74272.51] 20 sec: 110486 operations; 4518.7 current ops/sec; [INSERT AverageLatency(us)=112042.35] 30 sec: 156595 operations; 4610.44 current ops/sec; [INSERT AverageLatency(us)=107693.98] 40 sec: 200412 operations; 4381.26 current ops/sec; [INSERT AverageLatency(us)=114637.18] 50 sec: 244472 operations; 4405.56 current ops/sec; [INSERT AverageLatency(us)=113530.38] 60 sec: 277914 operations; 3343.87 current ops/sec; [INSERT AverageLatency(us)=149380.72] 70 sec: 334896 operations; 5698.2 current ops/sec; [INSERT AverageLatency(us)=87720.48] 80 sec: 376963 operations; 4206.28 current ops/sec; [INSERT AverageLatency(us)=108319.02] 90 sec: 409562 operations; 3259.57 current ops/sec; [INSERT AverageLatency(us)=167112.51] 100 sec: 452261 operations; 4269.9 current ops/sec; [INSERT AverageLatency(us)=116234.15] 110 sec: 498764 operations; 4649.84 current ops/sec; [INSERT AverageLatency(us)=108424.99] 120 sec: 534884 operations; 3611.64 current ops/sec; [INSERT AverageLatency(us)=136935.42] 130 sec: 577462 operations; 4257.8 current ops/sec; [INSERT AverageLatency(us)=118485.47] 140 sec: 615917 operations; 3845.12 current ops/sec; [INSERT AverageLatency(us)=129717.4] 150 sec: 641200 operations; 2528.05 current ops/sec; [INSERT AverageLatency(us)=198541.11] 160 sec: 676141 operations; 3494.1 current ops/sec; [INSERT AverageLatency(us)=143056.78] 170 sec: 705582 operations; 2943.81 current ops/sec; [INSERT AverageLatency(us)=156306.7] 180 sec: 742158 operations; 3657.23 current ops/sec; [INSERT AverageLatency(us)=147808.59] 190 sec: 787505 operations; 4534.25 current ops/sec; [INSERT AverageLatency(us)=108489.19] 192 sec: 800000 operations; 4718.66 current ops/sec; [INSERT AverageLatency(us)=93357.86] [OVERALL], RunTime(ms), 192690.0

I tried to put the -p <> parameters before and after the -P workloads/workloadf - the result was the same.

However this doesn't change the fact that the issue is a false alarm.

thanks, vlad

ERROR 12:06:11,457 Exception in thread Thread[CompactionExecutor:18,1,main] FSWriteError in

/var/lib/cassandra/data/usertable/data/usertable-data-tmp-jb-218-Filter.db at

org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:478) at org.apache.cassandra.io.util.FileUtils.closeQuietly(FileUtils.java:212) at

org.apache.cassandra.io.sstable.SSTableWriter.abort(SSTableWriter.java:304) at

org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:209) at

org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at

org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at

org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at

org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:198) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.FileNotFoundException:

/var/lib/cassandra/data/usertable/data/usertable-data-tmp-jb-218-Filter.db (No space left on device) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:110) at

org.apache.cassandra.io.sstable.SSTableWriter$IndexWriter.close(SSTableWriter.java:469) ... 13 more

— Reply to this email directly or view it on GitHub

https://github.com/cloudius-systems/osv/issues/379#issuecomment-51047563 .

— Reply to this email directly or view it on GitHub https://github.com/cloudius-systems/osv/issues/379#issuecomment-51047871.

wkozaczuk commented 3 years ago

I dot have time nor resources to try to reproduce it on GCE. But the last 2 comments by @vladzcloudius seem to indicate that the original issue was a "false" alarm. On top of that Cassandra is no longer is a key focus. So I am closing it.