Writing and Reading Sections of .e57 File

JonKirkland commented 1 year ago

Hello, Currently I only know how to read and write e57 files while storing the data for an entire file in a buffer. Since I have some large .e57's I would like to work with, I was wondering if it was possible to: Read from 0-5 million points, write 5 million points, then read from 5-10 million points etc. not so much memory is being used. Looking at the docs I saw CompressedVectorReader.seek(), but I was not able to get this working for me. I have also not been able to find any example in the tests. If you anyone could outline a way for this to be done I would greatly appreciate it.

asmaloney commented 1 year ago

The code for seek() looks like this:

   void CompressedVectorReaderImpl::seek( uint64_t /*recordNumber*/ )
   {
      checkImageFileOpen( __FILE__, __LINE__, static_cast<const char *>( __FUNCTION__ ) );

      // !!! implement
      throw E57_EXCEPTION1( ErrorNotImplemented );
   }

This is related to https://github.com/asmaloney/libE57Format/issues/79 - though you are also asking for a batch/streaming interface.

(I've mentioned in other places that I started a new implementation from scratch a while ago. I'd implemented batched reading the way you describe because I think it makes a lot of sense!)

JonKirkland commented 1 year ago

Sorry to revive this, but is the fact that libe57 uses Xerces preventing file streaming? I'm just curious as I've been using this library a lot and have started to poke around the code to gain a better understanding. Great project btw.

asmaloney commented 1 year ago

is the fact that libe57 uses Xerces preventing file streaming?

Nope - that's a separate issue. The issue with Xerces is that it is like using a sledgehammer to push a tack into cork - and it's been a constant source of problems to include & build. A small simple implementation like pugixml would be better. The structure of libE57Format's code, however, makes replacing the XML a fair bit of work.

For streaming, I think it would be possible to implement CompressedVectorReaderImpl::seek and use it somehow (which I believe was the original intent), but not efficiently because the library doesn't implement certain features from the standard (e.g. indexing).

dancergraham commented 3 weeks ago

For streaming, I think it would be possible to implement CompressedVectorReaderImpl::seek and use it somehow (which I believe was the original intent), but not efficiently because the library doesn't implement certain features from the standard (e.g. indexing).

Any ideas / pointers on how this would be done? Are you referring to the ASCE standard for e57 files?

asmaloney commented 3 weeks ago

Are you referring to the ASCE standard for e57 files?

The ASTM standard specifies a way to set up indices. Up until 3.2, this library (and the "reference" one) didn't include any index packets at all even though at least one is required by the standard.

I think the seek method was supposed to use these indices to quickly jump to a record (hence the param recordNumber). If these indices were implemented properly, then you could jump to a specific range of points efficiently (e.g. "read 100k points starting from record 11,56,278").

In my other E57 implementation, reading & processing is done in batches instead of all points at once, which I think is a better way to handle reading in general. Something like:

PointData pd = <read from file structure>;
pd.setBatchsize( 1024 * 100 );
auto readCallback = [](const PointRecordList &inList) {
   // process the points - inList is "batch size" in length (or however many are left to be read)
};
pd.readByBatch( readCallback );

Something like this could probably be implemented on top of libE57Format with a bit of work.

JonKirkland commented 3 weeks ago

Hello, I was looking at this again two months ago and spent a little bit of time trying to implement batching/chunking to reduce process memory. Here are some possible pointers / a rundown of how it works to help get started if anyone else wants to have a look.

First memory is allocated when passing a Data3D header to Data3DPointsData_t constructor, it passes the point count to the various buffers, so there is enough memory to read in all the data. This needs changing, possibly want to add another constructor with a batch size argument.

Secondly, the SetUpData3DPointsData() function checks the structure of the data and maps the various fields (like color, cartesianX, Intensity) to the corresponding buffer, the vector of SourceDestBuffer's just tells the CompressedVectorReader where to read each field to. So I don't think this needs changing for batching.

Next the actual reading of the data is done using the CompressedVectorReader returned from the above function. Within CompressedVectorReaderImpl::read().

Within read the main logic is within 3 functions:

BitpackDecoder::inputProcess() - I am not sure what this actually does and do not know if this needs to be changed.

earliestPacketNeededForInput() - This gets the memory offset to read the next packet of data. In order to implement batching without seek(), (so reading from 0 to pointcount, but without using all the memory), the offset of the last batch needs to be saved, possibly as a part of the reader class, this way the offset can be kept track of in between batches.

feedPacketToDecoders() - This takes the offset as an argument, possibly don't need to change anything here, not sure.

I remember I got stuck somewhere, but I also realize I haven't tried everything I wrote above, so I'll probably have another go at it within the next couple weeks if no one else does.

asmaloney / libE57Format

Writing and Reading Sections of .e57 File #244