io: reader for fits.fz files containing protobufs compressed with CTA compression

dneise commented 6 years ago

Hello, yesterday in the SST1M analysis meeting it was agreed that SST1M wants to contribute as much as possible to ctapipe, and to gain as much as possible from ctapipe.

SST1Ms raw data format is "fits.fz". There exists a dedicated reader for this file format, including Python bindings.It is maintained by @Etienne12345.

The reader can for example be installed like this:

conda create -n test_protozfits python=3.5
source activate test_protozfits
conda install numpy
pip install http://www.isdc.unige.ch/~lyard/repo/ProtoZFitsReader-0.42.Python3.5.Linux.x86_64.tar.gz
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.5/site-packages:$LD_LIBRARY_PATH

And then tested (SST1M is currently looking into the possibility to add small example files to ctapipe-extra), like this:

import protozfitsreader as pz
f = pz.ZFile(path)
# to stuff with f

Now the question is, how should one proceed to integrate this reader into the io package of ctapipe.

dneise commented 6 years ago

Ah ... I just see there is already an open PR #340 about this issue.

I'll review that one now, in order to understand what has already been done.

dneise commented 6 years ago

We just talked to @Etienne12345 on the phone. He said he is planning to publish the reader as a conda package. I think this solves this issue. I just would keep it open until the package is ready.

watsonjj commented 6 years ago

I believe this raises an important issue. Camera teams want to start reading their data into ctapipe, and members of the MC validation teams want to use ctapipe for common processing of the data between MC and real. However with the lack of a common data format, each camera team wants to read in their own custom data format.

However, as these data formats are unofficial and will not be maintained by ctapipe in the future, I have been hesitant adding these readers into the main ctapipe repository. I have not included the ctapipe reader for CHEC in the main repository.

I am starting to change my opinion on this approach. With these readers hidden away, camera teams that are new to ctapipe have no reference on how they should do it. Therefore I propose creating a ctapipe/io/unofficial directory, where I will write guidelines on creating a reader for a cameras custom data type, and where the readers will be stored. This will keep some order among the unofficial readers, and will be retired once the common data type is defined.

Inside the guidelines I will outline how to implement the reader in a way such that it does not introduce dependencies on external camera software, so normal ctapipe users will only need to install the software if they want to read these specific data files.

@kosack, does this seem like an acceptable approach?

watsonjj commented 6 years ago

Also in the guidelines I would advise camera teams to include installation instructions for their custom software in the documentation of their ctapipe reader.

Etienne12345 commented 6 years ago

Just a clarification: I plan to make a conda installer for the zfits reader -only- if it makes it to ctapipe/io (official or not). As long as it remains a custom, separate package it looks to me that the efforts to move from pip to conda are somewhat not worth it

maxnoe commented 6 years ago

@Etienne12345 I think you have a wrong impression of what it means to provide a conda package.

Any python package will still be installable using pip, for a conda package you only have to provide additional meta data like dependencies on conda packages in yaml format.

Is the source code of the zfits reader public? If yes I can have a look to also add the conda recipe.

See https://conda.io/docs/user-guide/tutorials/build-pkgs-skeleton.html

So it's not moving away from pip.

Etienne12345 commented 6 years ago

Actually I have no impression at all: I have very little experience with conda packages: thanks for the clarification !

Sure everything is public. The package is available from where Dominik listed it in the first post above. If you want the source code of the c++ library itself, then it is in the cta svn into ACTL/CamerasToACTL/trunk/IO/Fits/Python/

If you have a look at it would be great ! otherwise I'll look into your linked documentation asap.

dneise commented 6 years ago

Can somebody remind me where the CTA SVN is hosted?

Etienne12345 commented 6 years ago

For direct svn access: svn+ssh://USERNAME@svn.in2p3.fr/cta/ACTL/CamerasToACTL/trunk But you won't be able to access it unless you have provided your public ssh key to them.

Alternatively you can browse it from the cta redmine: https://forge.in2p3.fr/projects/ctaactl/repository

maxnoe commented 6 years ago

I can login but I get a Permission denied error (for the redmine)

dneise commented 6 years ago

svn+ssh://USERNAME@svn.in2p3.fr/cta/ACTL/CamerasToACTL/trunk

thanks ... worked ... (must have done that sometime in the past already)

fvisconti commented 6 years ago

Hi everyone, I post this message on behalf of the ASTRI team in Rome (not everybody in my team have access here):

A good lossless compression alghorithm for CTA huge amount of data is mandatory in order to limit the file transfer and storage. For this reason during 2014 we evaluated different compression tools and in particular we started a collaboration with William Pence and Rob Seaman (NOAO/NASA) to find whether the fpack CFITSIO utility could come into help for the ASTRI archive and pipeline data-chain: is it the compression used for fits.fz?

That software was very promising (and well advertised), but after some interactions and the real use on cherenkov data, we stated that expectations were not satisfied. So we decided in the first months of 2015 to migrate to use the standard and robust gzip tool to compress data in the ASTRI archive.

Is it possible to have a full description of the improvement in the current use of Fpack for CTA data (respect the 2014 version)? Despite the 2014 fpack lack of maturity, we hope that Etienne's work will have generate a more robust and usable algorithm for CTA.

Cheers

dneise commented 6 years ago

This issue has somehow changed topic and I would like to come back to the original topic:

"How should one proceed to integrate this reader into the io package of ctapipe."

In this respect I would like to reply to what @watsonjj said above:

With these readers hidden away, camera teams that are new to ctapipe have no reference on how they should do it. Therefore I propose creating a ctapipe/io/unofficial directory, where I will write guidelines on creating a reader for a cameras custom data type, and where the readers will be stored. This will keep some order among the unofficial readers, and will be retired once the common data type is defined.

I think what Jason proposed is a solution to our problem. He has readers, which can serve as examples and even wants to add some guidelines. And into which folder some module goes is sometimes a question of taste.

But why not simply put them into ctapipe.io?

To read from Rule 3 again: "Most scientists don't have requirements, they are their own users, --> closest fit from industry is Agile Development".

Some Agile methods (e.g. TDD) say: Choose the most simple solution that makes your test pass.

When CTA in the future crowns one file format to be the ruler of them all, and thus ctapipe-devs retire the other readers and move the single reader which turned out to be the "official one" from io.unofficial to io. And when then the issues come flowing in: "Hey, why can't I read my old XYZ files anymore from commissioning times in 2018?".

Do you really think these users are any more happy when your answer is: "C'mon ... that reader was all the time in io.unofficial you cannot complain about the fact, that we have deleted it now." ?

watsonjj commented 6 years ago

I strongly argue against keeping these unofficial readers in ctapipe for much longer after the cta data format has been decided, for multiple reasons:

ctapipe cannot be expected to maintain this code. Especially due to the time and effort required to install the external software for each reader.
It encourages laziness in the adoption of the new CTA data format.
It adds confusion on which data format is actually supported by cta.
If someone wants to read their old data format, it is easy enough to add their reader on their local ctapipe branch or an external package that imports from ctapipe. Alternatively, they can create a convertor from their old data format into the official one.

I believe ctapipe is not the correct place for these unofficial readers, but as we still do not have an official data format yet, we are forced to have a temporary solution.

watsonjj commented 6 years ago

"Most scientists don't have requirements, they are their own users, --> closest fit from industry is Agile Development".

Individual Scientists may not, but CTA definitely does. Ctapipe is part of the processing chain to produce the data for the scientists, only "experts" will have a local copy of ctapipe and perform their own data processing. It is up to these "experts" if they want to read old data formats, but this is not a process supported by CTA.

watsonjj commented 6 years ago

An alternative approach could be to create a new github repository "ctapipe-unofficial-readers" for these unofficial readers. Ctapipe could optionally import these readers for the meantime, but in the future would not. This is exactly how I currently do it for the CHEC reader, and is quite simple.

This would seperate them from official ctapipe, but they would at least continue to exist in their repository after the the transition to the official data format, making it easier for an "expert" to read their old data format in the future.

maxnoe commented 6 years ago

I strongly oppose this add of complexity. CTA is in a prototyping phase, for the Telescopes as well as for the Software. These readers belong into ctapipe, to make life for all people affected easier and more transparent.

watsonjj commented 6 years ago

Okay, perhaps a compromise between the two. The unofficial readers exist in ctapipe.io.unofficial during the prototyping phase (I think it's important we have some directory structure to maintain order with these different readers).

Once the prototyping phase is complete, they are moved to a new repository in the cta-observatory channel for archiving. There is no requirement to maintain these readers. If someone wishes to read the old data formats, then can use this repository for reference.

dneise commented 6 years ago

I think there is no reason to fight over this.

I believe it can only be positive for ctapipe to be useful for as many people as possible, hence including unofficial CHEC and SST1M file format readers into ctapipe is in my opinion preferable over "hiding" them in another place, where people might not find them.

If we agree on that, and I think we do, then we should not put too much of our time and power into a discussion about where to exactly put it.

So if Jason wants to put them into io.unofficial I prefer that a lot over not putting it in at all.

kosack commented 6 years ago

I don't mind either way - a separate package or including them in ctapipe. If we include them, however, they must be testable, meaning we need small (<10 MB) test files to validate them, and they must have working tests that run in the continuous integration system. That also means we need additional dependencies (at least temporarily). For readers that need lots of C++ code behind them, that may present some maintenence issues (unless that part is separated into a separate package).

In the future, we'll hopefully have an "official" CTA format, and we will replace all of these with that (including hessio).

One thing we could do is the following:

have the high-level event sources inside ctapipe/io
keep the low-level access libraries as separate packages (like pyhessio, zfits)
don't require users to have the low-level libraries (make the import fail gracefully, so they are optional)
in the CI system, make sure we do have all the low-level libraries so the testing can be made.

Remember also that we are going to soon fully re-factor the Container class hierarchy, which means also refactoring the event source generators or FileReader classes (I'd also like to merge those two into a single interface). So if we add to many readers now, the work will be of course larger. On the other hand, it may be easier to update all of them at once, and make a more uniform interface if we can see how all of them work.

dneise commented 6 years ago

Thank you Karl for this answer

[...] it may be easier to update all of them at once, and make a more uniform interface if we can see how all of them work.

I believe so, yes. Hence I tried to revive the protozfitsreader PR in PR #598

If I translate your answer to the SST1M case, I understand we need

[x] our low level reader as external packages protozfitsreader
[x] high level event source in ctapipe.io, like in #598 in ctapipe.io.zfits
[x] like this try-import-except-raise-block ?
[ ] Continous integration: Currently the "protozfitsreader" python package (I know of) is python3.5 only, but ctapipe CI tests 3.5 and 3.6. So currently this would fail on 3.6
[ ] small test files.
[x] Keep future refactoring in mind.

watsonjj commented 6 years ago

I will work on creating guidelines over the next week. In the guidelines I would advise on how to best make the imports "fail gracefully".

I would advise @dneise to wait for these guidelines to be complete before declaring pull request #598 complete.

I will need to figure out how I can make our GCT software installable for the CI system... this is not immediately obvious, as it is a C++ library, that depends on another C++ library, both existing in svn, and both wrapped with SWIG to allow python interface.

watsonjj commented 6 years ago

What if some teams do not want their custom reader software to be made public (likely the situation with our reader), but they wish to temporarily read their custom data format into ctapipe? As this is temporary code, is it not satisfactory to have tests that only run if the custom software is installed on the system?

maxnoe commented 6 years ago

What if some teams do not want their custom reader software to be made public

Really?

watsonjj commented 6 years ago

Yes, the C++ library we created to read our data format is located privately on SVN. We do not want to move it to another location, as this is software only intended to be used within the team, and exists as a placeholder until the common data format is defined. I was hoping I could have some useful input and discussion on a solution that would work for our case, as I am certainly not aware of all the options out there.

watsonjj commented 6 years ago

I understand the zfits reader is also a C++ library at its core. Perhaps I could be given some insight into how to go from my situation (a C++ library, that depends on a C++ library, that uses SWIG to wrap the code into a python module) into something that is deployable via conda

dneise commented 6 years ago

I think, in science, keeping something private is a power and time drain. Collaboration is hindered greatly. I do not know anything about deploying closed source applications.

As Max explained elsewhere, any pip installable project can be converted into a conda package using conda forge. I have experience with conda forge or conda packaging at all.

@Etienne12345 is the author of the protozfits reader, which is indeed written in C++, and he provided digicampipe with a precompiled package. You can see its installation instructions in the README of digicampipe (which we are currently trying to merge into ctapipe).

https://github.com/calispac/digicampipe

You can download the tarball and look at it. There are some python files in the tar-ball which can be looked at. But there are also precompiled shared objects which might not be scrambled, and can be disassembled, but they are certainly not easy to read.

From the installation instructions you see, this tar-ball depends on python 3.5. As the rawzfitsreader.cpython-35m-x86_64-linux-gnu.so was build for python3.5 a python3.6 will refuse to import the file.

I have no experience with deploying precompiled lapplications to end-users, hence I do neither know how to make the protozfitsreader package py3.5 and py3.6 compatible, neither do I know if these libraries would work on Apples as well as Linux boxes. I personally find it easier to deploy source code and compile the stuff on the end-users Box, but again, just due to my lack of expertise with precompiled apps. Etienne is to my knowledge currently working on a conda package for the protozfitsreader, so he might have more detailed information about how to keep the source code a secret but still provide a conda package.

watsonjj commented 6 years ago

I understand your arguments against private software. But unfortunately, CTA SVN is where the software lives, and that won't change. If we can come up with a method to either find a way that Travis CI can access and install the dependency libraries from the SVN, or produce a deployable package of the dependency libraries, then this problem is solved.

dneise commented 6 years ago

I have a question here, while this is certainly true:

CTA SVN is where the software lives

And I know the CTA SVN is private, but I am wondering if the authors of the software you are referring to are actually enforcing a closed source policy or if the code is "private by accident".

For example I know that large parts of hessio are published under LGPLv2.1 which is to my understand a so called "copyleft" license, which states that any derivative of the licensed work must also "inherit" that license. I am no copyright lawyer, so I might be entirely wrong here, but I assume there is a good chance that the fact that CTA software is being developed in a closed and private envirenment, does not mean that the software cannot be released to the public, for the sake of peer review or testing or so.

Ultimately I do not want to point at licenses, but I'd like to understand the original authors intention. Do they simply use the private SVN because it is there, or because the actively want to protect their intellectual property?

That all being said (sorry for being so long winded, but this matter agitates me)

If we can come up with a method to either find a way that Travis CI can access and install the dependency libraries from the SVN

There is a way to store confidential information encrypted in a public travis.yml file. So I strongly assume one can let travis checkout and build third party dependencies from a private SVN.

https://docs.travis-ci.com/user/encryption-keys/

However I have hardly any experience with this.

watsonjj commented 6 years ago

Thanks for all the help so far.

I have created #610, which contains the guidelines I advise for creating a new data format reader in the documentation.

As stated in my PR, I have still to add the installation for the target reader dependencies to the .travis.yml file, but that is my intention, using the approach you described to "build third party dependencies from a private SVN".

dneise commented 6 years ago

I think this can be closed?

cta-observatory / ctapipe

io: reader for fits.fz files containing protobufs compressed with CTA compression #590