Closed yiakwy closed 4 years ago
Instead of scanning the disk to get real file size,
const size_t file_size = io::raw_lseek (fd, 0, SEEK_END);
io::raw_lseek (fd, 0, SEEK_SET);
I do recommend using the following codes (send the command to the kernel to query inode information):
std::ofstream dest;
// read body stream
std::ifstream fs;
fs.open(in_file_name.c_str());
if (!fs.is_open() || fs.fail()) {
PCL_SEGMENT_LOG_INFO << PCL_SEGMENT_FORMAT("Could not open file %s.", in_file_name.c_str());
return -1;
}
struct stat attrib;
// see discussion from mail archive: http://www.cpptips.com/fstream
// also see https://stackoverflow.com/questions/11558447/retrieving-file-descriptor-from-a-stdfstream
if (stat(in_file_name.c_str(), &attrib) < 0) {
PCL_SEGMENT_LOG_INFO << PCL_SEGMENT_FORMAT("Could not find metadata of file %s", in_file_name.c_str()) << std::endl;
}
mem_size = (size_t) attrib.st_size;
fs.seekg(data_idx);
Hey @yiakwy . Please indulge my questions and comments as I try to understand what you're trying to accomplish.
From what you describe, you seem to be interested in splitting PCD files into multiple ones. The idea is then to use a streaming approach in order to load continuously chunks of data, which will then populate a pcl type, either PCLPointCloud2
or the plain old PointCloud
. You already noticed that the PCD Reader/Writer makes use of memory mapping to quickly read/write data to the harddrive. This is a problem for you because you want to avoid loading all data into memory.
At first sight I see an issue here: at some point there will need to exist one PointCloud
or PCLPointCloud2
variable that will hold all the data you've read in memory, even if you hack the reader to read from multiple files. None of these core types are designed to behave like a stream. Is this going to be a problem for you?
@SergioRAgostinho Hi, thanks for your comment. I have a class PCDSplitter and its implementation. The instance of the class holds a PCLPointCloud2 (I am processing rosbag data in V7 format) instance which only stores the header and helps me dynamically generate a new header for a bucket then flush the cache and write them to disk sequentially or in muti threads context.
I am currently trying to hack the LZF algorithm so that I can decompose the compressed data in chunks without loading the whole data into memory.
I understand that LZF is a column based compressing algorithm like the famous puppet algorithm published by google. This is what really concerns me a lot.
I see... I'm not familiar at all with LZF implementation so I can't really point you in any directions. Maybe someone else at @PointCloudLibrary/maintainers can chime in but my guess is that no one is familiar with that code.
I feel that what you're trying to achieve doesn't really have much use to the library at this point because we're not at stage where we can handle a really large number of points. There's a number of methods which are currently failing above certain size thresholds. Ultimately, handling extremely large data is something we want to have but it is extremely low priority at the moment. Batch processing/streaming is not one of our goals right now.
I agree with your judgement with respect to the priority of the algorithm. But I do identify certain cases where it might be used. I also observed that Autoware (a Japanese Autonomous Vehicle Driving solution) implemented grids data structure to split a large pcd data into pieces.
I communicated with our ADAS team. For good compression, they would like to store data in large files. However, our algorithms (including PCL) is suitable for data file of size around tens of MB. Loading a file into memory could be up to several minutes. We are constructing a pipeline to perform preprocessing. We could asynchronously perform loading data into memory by "splitting", "merging" and "downsampling" on need just like what we have done in web (think about map tiles. There is no web tiles framework for HD Map for the moment). In web, we have implemented an algorithm to split any PCD files into chunks and feed them to in streaming.
That's why I threw the question into our community to seek any help from our PCL members.
Marking this as stale due to 30 days of inactivity. It will be closed in 7 days if no further activity occurs.
We might revive this once we finish the migration to index_t
AND the community shows interest again.
Closing in the mean-time.
Background
I am implementing a PCD file splitter to split large PCD files (more than 1 GB) using PCL V1.8. I have several algorithms to be implemented like, streaming, multi-threads, async io and a mixture of them.
Here I have some troubles to implement splitter for binary PCD files using streaming strategy. For example, when reading an ASCII PCD file, I have ostringsream reads data from a PCD file, once a buffer is full or the size of data read achieves bucket size, write them to disk.
Subject
Hack on pcd::io::PCDReader/PCDWriter binary reading and writing methods:
So that we can split a large binary PCD data into pieces without loading the whole data into memory.
Details
One plausible solution for binary PCD files splitting could be using readBodyBinary/writeBodyBinary, but it consumes the whole memory to decode all the data using lzf decomposer (pcl implements its owner version of lzf encoder and decoder) and load them to the attribute of data of a PointCloud2 instance.
I felt sucked here. Compared to ASCII files, relevant methods like readBodyBinary, writeBinary are hard to understand and I cannot perform a streaming strategy to split data into pieces without loading all the data into memory.
Technical Details
READER
Just like what I have done for ASCII file, I need to figure out how pcd binaries are read by PCL reader.
Different from ascii files, we open the file with ordinary file io and maintain a file descriptor
Then we calculate the files size using disk seeker at lines 750 and 751
Why don't we read file metadata directly?
data_idx
points to where binary data starts after we scan the header.Could anyone give me any hints on how to use
compressed_size
?Continue, from line 811 to 820, we found:
Let us see wikipedia. It saids that mmp is used for mapping and shared memory. I
suspect
that the people who implemented it also implemented several revelant tools for logging and inspectation. However I haven't seen any tools or documentation about it.Now we begin to process mapped memory from the original pcd file.
Writer
We only analyze
writeBinary
here, because other methods are either unrelevant or too trival for us to write splitter by reading data stream from a pcd file.Well in a matter of fact,
writeBinay
is much more easier compared withreadBodyBinay
. Here is the most important operation:I am unexpectedly curious about construction of
cloud.data
from scratch. There is no magic.But I am not able to find codes populating cloud.data.
@taketwo @UnaNancyOwen Sugiura