Open romulogoncalves opened 6 years ago
Hm, if you'll have a smaller file to reproduce this issue with - it would be really helpful.
the issue is that, it only happens with big files. I will try to debug it a bit more.
Can be some memory allocation issue than.
@pomadchin
When we run for a small file we do not get a crash. However, if check the number of points returned using the metadata it does reflect the crop output:
{
"metadata":
{
"filters.crop":
{
},
"readers.las":
{
"comp_spatialreference": "",
"compressed": true,
"count": 24487575,
It should be 523787
.
To read obtain this result we use the RDD defined above and we do:
rdd_laz.map{ case (h,i) => h.metadata}.collect().foreach(println)
In case we write the result of the pipeline into a local file its metadata has the correct number of points (that's how we know the crop worked out).
We think the metadata in the RDD should also be updated when we do a crop
, do you agree?
@pomadchin I have isolated the code and now it should be easy to debug the issue. I am using an Ubuntu virtual machine with 4 cores and 32 GB main-memory and 48GB of swap.
The data can be obtained from here (it has around 500M points):
wget https://geodata.nationaalgeoregister.nl/ahn3/extract/ahn3_laz/C_25GN2.LAZ
Then the following code was executed in Jupyter notebook running a scala kernel. The code is a light version of your code from PointCloudInputFormat.scala
import io.pdal.pipeline.{CropFilter, LasRead, LasWrite}
import io.pdal.{Pipeline, PointView}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
import java.util
val compression_type :Option[String] = Option("laszip")
val polygon_str :Option[String] = Option("POLYGON ((128000.0 479000.0, 128000.0 485000.0, 136000.0 485000.0, 136000.0 479000.0, 128000.0 479000.0))")
val pipelineExpr = LasRead("/data/local/home/ecolidar/C_25GN2.laz", compression = Option("laszip"), spatialreference = Option("EPSG:4326")) ~ CropFilter(polygon = polygon_str)
val pipeline = pipelineExpr.toPipeline
pipeline.execute()
// If a filter extent is set, don't actually load points.
val (pointViewIterator :util.Iterator[PointView], disposeIterator): (util.Iterator[PointView], () => Unit) =
{
val pvi = pipeline.getPointViews()
(pvi.asInstanceOf[util.Iterator[PointView]], pvi.dispose _)
}
// conversion to list to load everything into JVM memory
val pointClouds = pointViewIterator.asInstanceOf[util.Iterator[PointView]].toList.map{ pointView =>
val pointCloud = pointView.getPointCloud()
pointView.dispose()
pointCloud
}.iterator
The pipeline execution consumed all memory plus 1.5GB of swap space. The issue is with conversion to list to load everything into JVM memory. While doing it the JVM crashes. It happens in the toList.map
as it is reported in the log file attached on the first message of this issue. We see the machine swapping out until 6GB and then the JVM crashes. Before doing the conversion to list to load everything into JVM memory the memory was not clean which is strange because we did a filter.crop
and 32GB looks too much for 15M points.
The answer in your case that there is nothing smart to do with it; the workarounds in this case obviously would be
It doesnt look too much, on the toList
conversion step we unpack compressed points stored in C
memory and copy everything into JVM memory (also it wont free C
mem until finishing copy into JVM
mem). It would use lots of memory for this operation: https://github.com/PDAL/java/blob/master/core/src/main/scala/io/pdal/PointCloud.scala#L45
Also a possible workaround is to think about better structures to store points, but the reals solution would be number 2.
P.S. in your code you don't need to call .asInstanceOf[util.Iterator[PointView]]
every time as it is already an instance of util.Iterator
: https://github.com/PDAL/java/blob/master/core/src/main/scala/io/pdal/PointViewIterator.scala#L38
@pomadchin I decided to dive a bit into this issue because I think lack of memory can't be the only reason. Take the following file:
wget https://geodata.nationaalgeoregister.nl/ahn3/extract/ahn3_laz/C_25FZ2.LAZ
It has 440MB size in LAZ format, it stores 91M points.
Execution exactly the same code above, even without filter.crop
we get a JVM crash.
import io.pdal.pipeline.{CropFilter, LasRead, LasWrite}
import io.pdal.{Pipeline, PointView}
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.JavaConversions._
import java.util
val compression_type :Option[String] = Option("laszip")
val pipelineExpr = LasRead("/data/C_25FZ2.laz", compression = Option("laszip"), spatialreference = Option("EPSG:4326"))
val pipeline = pipelineExpr.toPipeline
pipeline.execute()
// If a filter extent is set, don't actually load points.
val (pointViewIterator :util.Iterator[PointView], disposeIterator): (util.Iterator[PointView], () => Unit) =
{
val pvi = pipeline.getPointViews()
(pvi.asInstanceOf[util.Iterator[PointView]], pvi.dispose _)
}
// conversion to list to load everything into JVM memory
val pointClouds = pointViewIterator.asInstanceOf[util.Iterator[PointView]].toList.map{ pointView =>
val pointCloud = pointView.getPointCloud()
pointView.dispose()
pointCloud
}.iterator
The machine has 32GB main memory, so we do not think there is enough memory to process 91M points. The pipeline execution consumes 16% of the memory, the crash happens when 30% of memory is used (reported by top) during the copy to JVM. The report error is:
A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f82a6eaaca2, pid=21764, tid=0x00007f82a83e6700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_171-b11) (build 1.8.0_171-b11)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x6d5ca2] jni_SetByteArrayRegion+0xc2
#
# Core dump written. Default location: /home/ubuntu/core or core.21764
#
# An error report file with more information is saved as:
# /home/ubuntu/hs_err_pid21764.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
We think the issue is something else then lack of memory.
What's inside of the /home/ubuntu/hs_err_pid21764.log
?
You said even without a crop. Without a crop there would be more points. Try to crop by a small area, wondering what would be the result.
hs_err_pid21764.log in the attachments.
For a large file, 500M, we did a crop to 15M and even after that it crashed. We assumed there was so intermediate being created, therefore, it was the reason to crash JVM. Hence, we decided to test without cropping, just a clean read, and because of that we wrote "even without cropping". However, if we crop at most 70% of the points it does not crash. Sorry for the incomplete info.
How many dimensions this LAZ file has? Have you tried to select only interesting dimensions into JVM memory?
Have you tried to calculate how much memory should occupy such a 15 mill elements array?
It full fills an Array[Byte]
, the length of such array is possible to calculate: number of points * (sum of all dims size)
It has 13 dimensions. Assuming 6 of them are doubles and the other 7 are integers (we are overestimating) a record would be: (68 + 47) 91M = 6.6GB which is near 20% of our memory (it reported 16% after pipeline.execute
because we are overestimating).
Maybe this info helps on something, these are the dimensions reported by lasinfo
X 136524103 139999999
Y 487500000 490667607
Z -5799 90873
intensity 0 65527
return_number 1 5
number_of_returns 1 5
edge_of_flight_line 0 1
scan_direction_flag 0 0
classification 1 26
scan_angle_rank -33 32
user_data 2 2
point_source_ID 2477 4902
gps_time 47755.344545 218015.275790
Can you try to select only X, Y, Z and check the behaviour? I just feel suspicious to the array allocation method
So it stores all points in a single byte array, the issue happens at this point: https://github.com/PDAL/java/blob/caa95c9c9f73f341fd08db99c86e3173cb223868/native/src/io_pdal_PointView.cpp#L169
you can try to insert some debugging print here, to print the buffer length it would try to allocate: bufSize
Taking into account your input: (6 * 8 + 7 * 4) * 15M = 1.140B
length. Probably it can't allocate an array of a such size.
Remark: max array length is MaxInt == 2147483647
.
And 500M
just can fit this requirement at all: (6 * 8 + 7 * 4) * 500M = 38B
@pomadchin
A silly question, I have modified the sources you said, then I did:
./sbt native/nativeCompile
And then I thought sbt assembly
would be enough to get the new jars. However, I get a complain:
java.lang.NoClassDefFoundError: io/circe/generic/extras/semiauto$
How should I compile to get the the native compiled?
You don't have to build assembly, just publish everything locally using this script: https://github.com/PDAL/java/blob/master/scripts/publish-local.sh
@pomadchin thanks, that did the job. However, I do not see anything being printed. The following line was added:
std::cerr << "Point size is: " << pointSize << std::endl;
With a strace I see that the new libpdaljni.1.4.so
is being used. I am also using the new jars, I started scala consoled with the following classpath:
scala -J-Xmx12g -classpath "core/target/scala-2.11/pdal_2.11-1.7.0-SNAPSHOT.jar:core-scala/target/scala-2.11/pdal-scala-assembly-1.7.0-SNAPSHOT.jar:native/target/pdal-native-x86_64-linux-1.7.0-SNAPSHOT.jar"
It is strange I do not see it in the strace
log file. Do we need to redirect the stdout from C++ to the scala stdout? If yes, how do we do that?
Can you show me the code?
Why haven't you tried to write logs into stdout
?
Either cout
or cerr
we still do not get it printed.
The scala code we use is the the following:
import io.pdal.pipeline.{CropFilter, LasRead, LasWrite}
import io.pdal.{Pipeline, PointView}
import scala.collection.JavaConversions._
import java.util
val compression_type :Option[String] = Option("laszip")
val bounds_str :Option[String] = Option("([137000.0, 138000.0], [487500.0, 490667.0])")
val pipelineExpr = LasRead("/data/local/home/ecolidar/C_25FZ2.laz", compression = Option("laszip"), spatialreference = Option("EPSG:4326")) ~ CropFilter(bounds = bounds_str)
val pipeline = pipelineExpr.toPipeline
pipeline.execute()
// If a filter extent is set, don't actually load points.
val (pointViewIterator :util.Iterator[PointView], disposeIterator): (util.Iterator[PointView], () => Unit) =
{
val pvi = pipeline.getPointViews()
(pvi.asInstanceOf[util.Iterator[PointView]], pvi.dispose _)
}
// conversion to list to load everything into JVM memory
val pointClouds = pointViewIterator.asInstanceOf[util.Iterator[PointView]].toList.map{ pointView =>
val pointCloud = pointView.getPointCloud()
pointView.dispose()
pointCloud
}.iterator
I meant a new C function body sry for the confusion.
And what's the new error including the stacktrace in a log file?
The C function looks like this:
JNIEXPORT jbyteArray JNICALL Java_io_pdal_PointView_getPackedPoint
(JNIEnv *env, jobject obj, jlong idx, jobjectArray dims)
{
PointViewRawPtr *pvrp = getHandle<PointViewRawPtr>(env, obj);
PointViewPtr pv = pvrp->shared_pointer;
PointLayoutPtr pl = pv->layout();
// we need to calculate buffer size
std::size_t pointSize = 0;
DimTypeList dimTypes;
// calculate result buffer length (for one point) and get dimTypes
convertDimTypeJavaArrayToVector(env, dims, &pointSize, &dimTypes);
//THE NEW LINE ADDED
std::cout << "Point size is: " << pointSize << std::endl;
char *buf = new char[pointSize];
pv->getPackedPoint(dimTypes, idx, buf);
jbyteArray array = env->NewByteArray(pointSize);
env->SetByteArrayRegion (array, 0, pointSize, reinterpret_cast<jbyte *>(buf));
delete[] buf;
return array;
}
And the strace.zip obtained like this:
strace -o strace.out -ff
The .so
file was added to:
/usr/lib/libpdaljni.1.4.so
In strace you can see it picked up that one.
Ok, and what's hs_err_pid.log
?
You may try to kill the process after printing the length.
@pomadchin I injected a print statement in pipeline Execute
and it was printed out, that means I am running correctly the program.
I added a print statement where we get the PointSize
which is here:
https://github.com/PDAL/java/blob/caa95c9c9f73f341fd08db99c86e3173cb223868/native/src/io_pdal_PointView.cpp#L77
That showed me that the print statement should have been added here GetPackedPoints instead of GetPackedPoint
, i.e., we were printing in the wrong function. Now makes sense your request, print bufSize
.
Anyway, here important output:
We are at convertDimTypeJavaArrayToVector
Array dims is: 13
Getting now point size for: double
We got it: 8
Getting now point size for: double
We got it: 16
Getting now point size for: double
We got it: 24
Getting now point size for: uint16_t
We got it: 26
Getting now point size for: uint8_t
We got it: 27
Getting now point size for: uint8_t
We got it: 28
Getting now point size for: uint8_t
We got it: 29
Getting now point size for: uint8_t
We got it: 30
Getting now point size for: uint8_t
We got it: 31
Getting now point size for: float
We got it: 35
Getting now point size for: uint8_t
We got it: 36
Getting now point size for: uint16_t
We got it: 38
Getting now point size for: double
We got it: 46
Point size is: 46
bufSize is: 4195761936
It does not look like a big allocation.
Ah yes, sorry for confusion.
Ok, reread this comment: bufSize should be <= Int.MaxValue
as we convert C
array into Java Array[Byte]
, and max length of a such array is limited by an Int.MaxValue
.
Int.MaxValue
is only 2147483647
, that's why 4195761936
looks like a big allocation in terms on numbers of elements. That's why there are two approaches: to use a different and much slower strucutre on the JVM side instead of usual collections or to implement this: https://github.com/geotrellis/geotrellis-pointcloud/issues/12
Indeed, I recall that. I thought with stream
approach https://github.com/geotrellis/geotrellis-pointcloud/pull/11 was enough to solve that. Maybe we miss-understood it.
The best option would be to implement #12 to increase parallelism. But that for sure would take some time.
thanks for all feedback and help.
Yeah, #11 solved the problem of copying a big file into local fs. PDAL can work only with files on a local fs while hadoop doesnt care, so i decided to use streams here to copy files instead of common byte arrays, though it adds some constrains about this stream usage, it should be used carefully and we definitely should know why it's necessary.
We will see if we can get something for #12, but first we need to understand the code. We will get back to this issue coming days.
@pomadchin looking to the code, we will need some time to get #12 done. Not sure if we have time for that before July or even August.
For what we saw, PDAL-java
after executing the pipeline would have to give points in batches to create RDD
records. Each batch has a different header.
The issue is that Geotrellis-pointcloud
can't guess apriory how many points PDAL-java
should put in each RDD record because it does not know what the PDAL
pipeline will do. We could pass an upper limit, but that's all. In this case we should get a list of headers and iterators to build the RDD records at Geotrellis-pointcloud
.
Important to notice, the generation of the RDD would be done by a single executor because the file is downloaded and multiple downloads are not recommended, plus we have the pipeline execution which should be done over the entire LAS/LAZ file and not portions of it.
Please correct me if I miss-understood something.
it turned out that PDAL supports direct reads from S3
, see: https://github.com/geotrellis/geotrellis-pointcloud/pull/16; since PDAL 1.3
(?)
Mb it makes sense to think about adding HDFS / Hadoop compatible backends support into PDAL
?
If so, than we can read headers / stats only from files and make some sense about the desired distribution of points across partitions.
https://github.com/PDAL/PDAL/blob/master/vendor/arbiter/arbiter.cpp#L131-L168
The following read:
Leads to the following error:
The content of the log file hs_err_pid25838.log is in the attachments.
Any ideas or suggestions for such crash? We have a machine with 32GB, the worker uses 30GB and the executor 30GB, this is, one executor per worker.
If we need to provide more information please let us know.