[Feature & Hack] Split large PCD files into smaller files

yiakwy commented 5 years ago

Background

I am implementing a PCD file splitter to split large PCD files (more than 1 GB) using PCL V1.8. I have several algorithms to be implemented like, streaming, multi-threads, async io and a mixture of them.

Here I have some troubles to implement splitter for binary PCD files using streaming strategy. For example, when reading an ASCII PCD file, I have ostringsream reads data from a PCD file, once a buffer is full or the size of data read achieves bucket size, write them to disk.

Subject

Hack on pcd::io::PCDReader/PCDWriter binary reading and writing methods:

readBodyBinary
writeBinary

So that we can split a large binary PCD data into pieces without loading the whole data into memory.

Details

One plausible solution for binary PCD files splitting could be using readBodyBinary/writeBodyBinary, but it consumes the whole memory to decode all the data using lzf decomposer (pcl implements its owner version of lzf encoder and decoder) and load them to the attribute of data of a PointCloud2 instance.

I felt sucked here. Compared to ASCII files, relevant methods like readBodyBinary, writeBinary are hard to understand and I cannot perform a streaming strategy to split data into pieces without loading all the data into memory.

Technical Details

READER

Just like what I have done for ASCII file, I need to figure out how pcd binaries are read by PCL reader.

Different from ascii files, we open the file with ordinary file io and maintain a file descriptor

int fd = io::raw_open (file_name.c_str (), O_RDONLY); // line 742

Then we calculate the files size using disk seeker at lines 750 and 751

    const size_t file_size = io::raw_lseek (fd, 0, SEEK_END);
    io::raw_lseek (fd, 0, SEEK_SET);

Why don't we read file metadata directly?

data_idx points to where binary data starts after we scan the header.

      // Read compressed size to compute how much must be mapped
      unsigned int compressed_size = 0;
      ssize_t num_read = io::raw_read (fd, &compressed_size, 4);

Could anyone give me any hints on how to use compressed_size ?

Continue, from line 811 to 820, we found:

// using mapped memory here, I am not good at using the relevant technology
// could the author tell me more about it. I haven't seen any multi-threads, processes libraries related
// to PCD reader so that we need a shared memory.
unsigned char *map = static_cast<unsigned char*> (::mmap (0, mmap_size, PROT_READ, MAP_SHARED, fd, 0)); // I am using apple OSX machines. Is it possible to use `ostringstream` here?
...

res = readBodyBinary (map, cloud, pcd_version, data_type == 2, offset + data_idx);

Let us see wikipedia. It saids that mmp is used for mapping and shared memory. I suspect that the people who implemented it also implemented several revelant tools for logging and inspectation. However I haven't seen any tools or documentation about it.

Now we begin to process mapped memory from the original pcd file.

// based on the ret extract from pcl::io::PCDReader::readHeaderASCII
if (compressed) {
// data_type == 2
...
/*
because we are processing binary data, we use `memcpy` to copy data in memory format with add of the header to tell use how many data should we copy for a point so that we can count it and read it.
*/
...

// now copy our data to buf with computed size after decompression.
 unsigned int data_size = static_cast<unsigned int> (cloud.data.size ());
 std::vector<char> buf (data_size);
// The size of the uncompressed data better be the same as what we stored in the header
unsigned int tmp_size = pcl::lzfDecompress (&map[data_idx + 8], compressed_size, &buf[0], data_size);
...
// I have to say, if you really want a good compression ratio, puppet (colume based database binary storage solution) is a really good.
...
//  As you might note it, the author use `std::vector<char> buf (data_size)` as buffer. Why don't we use `char* buf = new char[data_size]` ?
...
// then we compute fields size. In my version of pcl::io, I wrapped the fragment of codes in a function. L587

    // Unpack the xxyyzz to xyz
    std::vector<char*> pters (fields.size ());
    int toff = 0;
    for (size_t i = 0; i < pters.size (); ++i)
    {
      pters[i] = &buf[toff];
      toff += fields_sizes[i] * cloud.width * cloud.height;
    }
    // Copy it to the cloud
    for (size_t i = 0; i < cloud.width * cloud.height; ++i)
    {
      for (size_t j = 0; j < pters.size (); ++j)
      {
        memcpy (&cloud.data[i * fsize + fields[j].offset], pters[j], fields_sizes[j]);
        // Increment the pointer
        pters[j] += fields_sizes[j];
      }
    }

// 

} else {
// much easier, use `memcpy` to copy data to `cloud.data` directly.
...
}

// convert data buf to PCLPoint2 data format.
...
} // the end

Writer

We only analyze writeBinary here, because other methods are either unrelevant or too trival for us to write splitter by reading data stream from a pcd file.

Well in a matter of fact, writeBinay is much more easier compared with readBodyBinay. Here is the most important operation:

msync (map, static_cast (data_idx + cloud.data.size ()), MS_SYNC);

I am unexpectedly curious about construction of cloud.data from scratch. There is no magic.

// https://github.com/otherlab/pcl/blob/master/common/include/sensor_msgs/PointCloud2.h, Line 41
...
std::vector<pcl::uint8_t> data;
...

But I am not able to find codes populating cloud.data.

@taketwo @UnaNancyOwen Sugiura

yiakwy commented 5 years ago

Instead of scanning the disk to get real file size,

    const size_t file_size = io::raw_lseek (fd, 0, SEEK_END);
    io::raw_lseek (fd, 0, SEEK_SET);

I do recommend using the following codes (send the command to the kernel to query inode information):

    std::ofstream dest;

    // read body stream
    std::ifstream fs;
    fs.open(in_file_name.c_str());
    if (!fs.is_open() || fs.fail()) {
        PCL_SEGMENT_LOG_INFO << PCL_SEGMENT_FORMAT("Could not open file %s.", in_file_name.c_str());
        return -1;
    }

    struct stat attrib;
    // see discussion from mail archive: http://www.cpptips.com/fstream
    // also see https://stackoverflow.com/questions/11558447/retrieving-file-descriptor-from-a-stdfstream
    if (stat(in_file_name.c_str(), &attrib) < 0) {
        PCL_SEGMENT_LOG_INFO << PCL_SEGMENT_FORMAT("Could not find metadata of file %s", in_file_name.c_str()) << std::endl;
    }

    mem_size = (size_t) attrib.st_size;

    fs.seekg(data_idx);

SergioRAgostinho commented 5 years ago

Hey @yiakwy . Please indulge my questions and comments as I try to understand what you're trying to accomplish.

From what you describe, you seem to be interested in splitting PCD files into multiple ones. The idea is then to use a streaming approach in order to load continuously chunks of data, which will then populate a pcl type, either PCLPointCloud2 or the plain old PointCloud. You already noticed that the PCD Reader/Writer makes use of memory mapping to quickly read/write data to the harddrive. This is a problem for you because you want to avoid loading all data into memory.

At first sight I see an issue here: at some point there will need to exist one PointCloud or PCLPointCloud2 variable that will hold all the data you've read in memory, even if you hack the reader to read from multiple files. None of these core types are designed to behave like a stream. Is this going to be a problem for you?

yiakwy commented 5 years ago

@SergioRAgostinho Hi, thanks for your comment. I have a class PCDSplitter and its implementation. The instance of the class holds a PCLPointCloud2 (I am processing rosbag data in V7 format) instance which only stores the header and helps me dynamically generate a new header for a bucket then flush the cache and write them to disk sequentially or in muti threads context.

I am currently trying to hack the LZF algorithm so that I can decompose the compressed data in chunks without loading the whole data into memory.

I understand that LZF is a column based compressing algorithm like the famous puppet algorithm published by google. This is what really concerns me a lot.

SergioRAgostinho commented 5 years ago

I see... I'm not familiar at all with LZF implementation so I can't really point you in any directions. Maybe someone else at @PointCloudLibrary/maintainers can chime in but my guess is that no one is familiar with that code.

I feel that what you're trying to achieve doesn't really have much use to the library at this point because we're not at stage where we can handle a really large number of points. There's a number of methods which are currently failing above certain size thresholds. Ultimately, handling extremely large data is something we want to have but it is extremely low priority at the moment. Batch processing/streaming is not one of our goals right now.

yiakwy commented 5 years ago

I agree with your judgement with respect to the priority of the algorithm. But I do identify certain cases where it might be used. I also observed that Autoware (a Japanese Autonomous Vehicle Driving solution) implemented grids data structure to split a large pcd data into pieces.

I communicated with our ADAS team. For good compression, they would like to store data in large files. However, our algorithms (including PCL) is suitable for data file of size around tens of MB. Loading a file into memory could be up to several minutes. We are constructing a pipeline to perform preprocessing. We could asynchronously perform loading data into memory by "splitting", "merging" and "downsampling" on need just like what we have done in web (think about map tiles. There is no web tiles framework for HD Map for the moment). In web, we have implemented an algorithm to split any PCD files into chunks and feed them to in streaming.

That's why I threw the question into our community to seek any help from our PCL members.

stale[bot] commented 4 years ago

Marking this as stale due to 30 days of inactivity. It will be closed in 7 days if no further activity occurs.

kunaltyagi commented 4 years ago

We might revive this once we finish the migration to index_t AND the community shows interest again.

Closing in the mean-time.

PointCloudLibrary / pcl