CNES / MAJA

Level-2A processor used for atmospheric correction and cloud-detection. The active repository is the one below, this one is kept to leave access to the older issues.
https://gitlab.orfeo-toolbox.org/maja/maja
Apache License 2.0
141 stars 25 forks source link

MAJA loops and the size of PMC_LxREPT.EEF continuously increases (multiple TB) #55

Closed jprankl closed 4 years ago

jprankl commented 4 years ago

For two tiles (33UYQ, 32TQT), which were correctly processed before, MAJA fails now. It does not proceed, it loops and the size of PMC_LxREPT.EEF continuously increases. Other tiles are still processed correctly.

MAJA Version 3.3.0
Error message: vns::Business::ERROR: ComputeScatteringCorrectionImageFilter(0x20cf500): For band id '3' (nb channel='4') ' no miniLUT has been generated for this angle zone (detector) : '4'. Angles zones are [0,5,6,7,8,9,10,] !!!,Check coherency between metadata and input zone mask source. [vnsComputeScatteringCorrectionImageFilter.txx:ThreadedGenerateData:238]

Last correctly processed product: S2A_MSIL1C_20200310T094031_N0209_R036_T33UYQ_20200310T101526.SAFE
Product where MAJA fails: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE
and does not further continue.

Log-file for the product where MAJA fails:
S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE.log

olivierhagolle commented 4 years ago

Hi, this message "Angles zones are [0,5,6,7,8,9,10,] !!!,Check coherency between metadata and input zone mask source" usually corresponds to an ill-formed product (L1C is not complete). It is due to ESA, and MAJA does not find the information it needs. We had this kind of error some years ago, it is not a good news if it comes back.

Could you try to remove this product from the list of products to be processed and try to process the next one ?

jprankl commented 4 years ago

Dear Olivier, thanks, that helps. I hacked a test case, jumped over that product and it continued correctly. But, actually, I think last year I also had that error. I simply ignored it. I did not necessarily need those products. But last year MAJA did not loop and fill the hard disk.

jprankl commented 4 years ago

This is one of the recent errors:
[E] vns::Business::ERROR: ComputeScatteringCorrectionImageFilter(0x18791900): For band id '0' (nb channel='4') ' no miniLUT has been generated for this angle zone (detector) : '11'. Angles zones are [0,12,] !!!,Check coherency between metadata and input zone mask source. [vnsComputeScatteringCorrectionImageFilter.txx:ThreadedGenerateData:238]

The bad news is, that in case of an error MAJA still loops and spams the HDD (TBs of PMC_LxREPT.EEF)

Any ideas how to resolve this issue, or at least detect the bad products before MAJA hangs?

petket-5 commented 4 years ago

Hi,

As Olivier said, we encountered this case some time ago but since then it didn't resurface. Especially the fact that the logfile explodes to multiple TB is problematic. I will see with the devs what needs to be modified inside Maja in order to counter at least this latter behavior. Not sure whether we will be able to output the product, though.

Where did you download your L1C data from - PEPS/SciHub?

Kind regards, Peter

jprankl commented 4 years ago

The products are from SciHub.

Yesterday I updated to MAJA Version 3.3.2. (from 3.3.0). Have there been any changes that might influence the problem?

Many thanks for the help! Johann

petket-5 commented 4 years ago

I am not sure if v3.3.2 will solve it, but it's definitely a good shot. I will be using the most recent, v3.3.5, in order to recreate the error on our platform. So far I have all the necessary info that I need - Should I need something else I will contact you here.

In the meanwhile, I am happy to hear from your results with the more recent version!

Peter

jprankl commented 4 years ago

o.k. find, I will drop a note in case I have news. 3.3.5 is not officially released, is it? At least there is no TM-version, which is the newer format, I read. Are from 3.3.2 to 3.3.5 any changes that might help?

petket-5 commented 4 years ago

I just realized that 3.3.5 [TM] was masked in the list of downloads still... It should appear now. Thank your for reminding me. Apart from that: No, I don't think there's a difference in version given that most changes from 3.3.0 to 3.3.5 concerned minor patches for problems related to Venus products.

Peter

jprankl commented 4 years ago

o.k. fine, that means, no need to update immediately. If I can help, or you need input, just drop a note.
Thx, Johann

petket-5 commented 4 years ago

Hi Johann,

I ran multiple tests with the same inputs during the night and in none of the cases maja loops indefinitely. Maja does fail with the error message described but that is - as mentioned earlier - out of our reach for now.

Can you please tell me what OS you're running on and under which circumstances you call maja? Also, could you please show me the content of PMC_LxREPT.EEF once maja fails?

Kind regards, Peter

jprankl commented 4 years ago

Hi Peter,

first of all, thanks for the tests. We are using:
CentOS Linux release 7.6.1810 (Core)
in a root shell on a Debian machine. We are using a slightly adapted version of 'start_maja.py'. I do not know the git commit. I cloned it about half a year ago. I only added a filter, which looks if a "L2NOTV_SENTINEL2.." exists and skips the product in case of a reprocessing and integrated it in our download script.

I also thought that the loop might come from our download script, but it seems that the maja-process is not interrupted/ restarted and there is no command-line output. That's why I thought the loop is within maja.

I do not have the PMC_LxREPT.EEF file. I deleted it because of disk space. But, I will try to reproduce it with a broken product. It will take a little bit of time because I think it makes sense to start from scratch and do the backward initialization before the broken product.

Thanks! Johann

jprankl commented 4 years ago

Dear Peter,

I could reproduce the behavior on our machine. MAJA produces a 3.5T PMC_LxREPT.EEF file. I tested it with the product: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE

top output:
3623 sen-dow+ 20 0 5582668 4.6g 76116 R 498.3 3.7 579:06.03 maja-processing

the log file:
S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE.log

the first part of the PMC_LxREPT.EEF file:
PMC_LxREPT.EEF.first_1000_lines.txt

Thanks! Johann

jprankl commented 4 years ago

Hi Peter, some more news. If you have a look at the first 20MB of the file, just the first 2000lines are xml, but the remaining MB are only blank:
PMC_LxREPT.EEF.first_20MB.txt.tar.gz
A mem leak?
Which system do you use? Libs?

petket-5 commented 4 years ago

Hi Johann,

I tried multiple systems this weekend and I cannot get this problem to reproduce on my end. Neither Centos 6/7 (the same 7.6.1810 you have) would show the same behavior. Do you run start_maja with sudo rights/in a docker?

Kind regards, Peter

jprankl commented 4 years ago

Dear Peter,

first of all, thanks for your investigations. Our system is not in a docker container but a jail root (chroot) on a Debian host.

I tried to debug it, to find out where (and maybe which third party lib) loops. For this, I manually prepared a test setup with preprocessed products to only process the broken one. But my test failed, the loop did not happen, MAJA has correctly thrown an exception. It seems that the loop is an edge case, a combination of environment and libs, ... I will try the debugging experiment from scratch and drop a note.

Bests! Johann

jprankl commented 4 years ago

Dear Peter,

I found another strange behavior:

the next 4 products had log files with zero size (no content).

Can that happen? Do you have an idea about the reason?

Thanks! Johann

petket-5 commented 4 years ago

Hi Johann,

While the log is empty, what about the L2A products themselves? Were they empty as well?

I never have encountered any of the behaviors you are showing me above - I have a slight suspicion that, as you mentioned, your environment might be at cause.

Peter

jprankl commented 4 years ago

Hi Peter, sorry, for the inconvenience. To reproduce the behavior with a debugger also failed. It seems to be some nasty edge case. At least, now I know about the broken products and can handle them properly. I anyway have to do a cleanup. I will install a fresh system and hopefully, the loop does not come up again. Normally I am using Ubuntu, but at the time I installed the system, MAJA did not work out of the box in Ubuntu, so I switched to CentOS. Which is your preferred system? What are you using normally?

petket-5 commented 4 years ago

Maja should be working fine with both CentOS >6 and Ubuntu >16.04, which is what we use internally for testing.

If you have any problems with Ubuntu, then I recommend to open up a new issue related to the error.

jprankl commented 4 years ago

Dear Peter,

we made a step forward, could reproduce the loop, and found out the point where it loops. It seems that it is a multi-threading problem. Maja tries to write to the log file PMC_LxREPT.EEF from multiple threads using pugixml v1.5, which seems to be not threadsafe according to the documentation. Could that be the case?

But let's go one step back. My colleague Chris, attached a debugger to Maja when it was looping and got the following callback:

#1  0x00007f7a00981b10 in pugi::xml_document::save(pugi::xml_writer&, char const*, unsigned int, pugi::xml_encoding) const () from /opt/maja/3.3.2/lib/libpugixml.so.1
#2  0x00007f7a00982202 in pugi::xml_document::save_file(char const*, char const*, unsigned int, pugi::xml_encoding) const () from /opt/maja/3.3.2/lib/libpugixml.so.1
#3  0x00007f7a0c0aa806 in vns::ReportLogger::Flush() () from /opt/maja/3.3.2/lib/libMajaDataCommon.so.3.3
#4  0x00000000004063dd in vns::Loggers::SendMessageToLog(std::string const&, vns::LoggerBase::LevelType const&, bool const&, char const*, char const*, unsigned long) ()
#5  0x00007f7a10a60d6a in vns::ComputeScatteringCorrectionImageFilter<otb::VectorImage<double, 2u>, otb::Image<double, 2u>, otb::Image<double, 2u>, otb::Image<unsigned char, 2u>, otb::VectorImage<unsigned char, 2u>, otb::VectorImage<double, 2u>, vns::VectorLookUpTable<double, 3u> >::ThreadedGenerateData(itk::ImageRegion<2u> const&, unsigned int) () from /opt/maja/3.3.2/lib/libMajaAlgorithmsL2.so.3.3
#6  0x00007f7a1105d0a1 in itk::ImageSource<otb::VectorImage<double, 2u> >::ThreaderCallback(void*) () from /opt/maja/3.3.2/lib/libMajaChain.so.3.3

pugi::impl::(anonymous namespace)::node_output loopes continously, which is from pugixml v1.5: https://github.com/zeux/pugixml/blob/v1.5/src/pugixml.cpp, Right?

It loops when it tries to write a "Message" Node xml_node_struct which is defined as:

00 uintptr_t header;
08 xml_node_struct*     parent;                 ///< Pointer to parent
10 char_t*                  name;                   ///< Pointer to element name.
18 char_t*                  value;                  ///< Pointer to any associated string data.
20 xml_node_struct*     first_child;            ///< First child
28 xml_node_struct*     prev_sibling_c;         ///< Left brother (cyclic list)
30 xml_node_struct*     next_sibling;           ///< Right brother

The node when it loops is at [0x47ee9a60] and the name = 'Message', first_child is [0x47ee9ab0] with name = 'Date_Time'.
Parent of [0x47ee9ab0] is [0x47ee9a20], also a 'Message' node which has a next_sibling [0x47ee9a60], which is the original 'Message' node. That's why it loops.

It seems that the log message is written form libMajaDataCommon vns::ReportLogger::BuildDedicatedFormattedEntry. That's where we ended up, because of no more debug symbols. But because writing xml seems to be straight-forward, we think it is a multi-threading problem, when multiple threads try to write to the xml-dom.

According to the documentation, it seems writing with pugixml is not threadsafe:

3.4. Thread-safety guarantees

Almost all functions in pugixml have the following thread-safety guarantees:

it is safe to call free (non-member) functions from multiple threads

it is safe to perform concurrent read-only accesses to the same tree (all constant member functions do not modify the tree)

it is safe to perform concurrent read/write accesses, if there is only one read or write access to the single tree at a time

Concurrent modification and traversing of a single tree requires synchronization, for example via reader-writer lock. Modification includes altering document structure and altering individual node/attribute data, i.e. changing names/values.

The only exception is set_memory_management_functions; it modifies global variables and as such is not thread-safe. Its usage policy has more restrictions, see Custom memory allocation/deallocation functions.

In our case, multiple active threads loop in pugi::impl::(anonymous namespace)::node_output which is one more hint towards a multi-threading problem.

Does that make sense? Is Maja multi-threaded when it writes the error to the log-file?

Best regards!
Johann

petket-5 commented 4 years ago

Hi Johann,

First of all sorry for the late answer, here are my findings from the last weeks:

We will observe this behavior for Maja4, hoping that it won't show up again. Keep an eye on the projet page to check when it is available - You should be able to upgrade without much changes as the interfaces will stay the same (Params, GIPP, input folder etc.). I hope that helped you!

Kind regards, Peter

jprankl commented 4 years ago

Hi Peter,

sounds good. Meanwhile, I switched from CentOS to Ubuntu 20.04 and Python 3, without any troubles. I also improved my Python-Script to avoid multiple trials of broken products. It seems that it minimized the chance of looping. At least it did not show up for a while.

I also integrated CAMS data, based on the provided download scripts. These scripts seem to be still in Python 2.7 style (print and one or two other issues). Should I prepare pull requests or is any way someone else working on it?

Thank's for all the debugging sessions and discussion:-) Best regards, Johann

petket-5 commented 4 years ago

Good to hear that it seems to be working now, and even better with CAMS! Updating the remaining scripts to python3 is on our list until release, so we should be covered.

Thank you for your help and the investigation! Peter