Closed jprankl closed 4 years ago
Hi, this message "Angles zones are [0,5,6,7,8,9,10,] !!!,Check coherency between metadata and input zone mask source" usually corresponds to an ill-formed product (L1C is not complete). It is due to ESA, and MAJA does not find the information it needs. We had this kind of error some years ago, it is not a good news if it comes back.
Could you try to remove this product from the list of products to be processed and try to process the next one ?
Dear Olivier, thanks, that helps. I hacked a test case, jumped over that product and it continued correctly. But, actually, I think last year I also had that error. I simply ignored it. I did not necessarily need those products. But last year MAJA did not loop and fill the hard disk.
This is one of the recent errors:
[E] vns::Business::ERROR: ComputeScatteringCorrectionImageFilter(0x18791900): For band id '0' (nb channel='4') ' no miniLUT has been generated for this angle zone (detector) : '11'. Angles zones are [0,12,] !!!,Check coherency between metadata and input zone mask source. [vnsComputeScatteringCorrectionImageFilter.txx:ThreadedGenerateData:238]
The bad news is, that in case of an error MAJA still loops and spams the HDD (TBs of PMC_LxREPT.EEF)
Any ideas how to resolve this issue, or at least detect the bad products before MAJA hangs?
Hi,
As Olivier said, we encountered this case some time ago but since then it didn't resurface. Especially the fact that the logfile explodes to multiple TB is problematic. I will see with the devs what needs to be modified inside Maja in order to counter at least this latter behavior. Not sure whether we will be able to output the product, though.
Where did you download your L1C data from - PEPS/SciHub?
Kind regards, Peter
The products are from SciHub.
Yesterday I updated to MAJA Version 3.3.2. (from 3.3.0). Have there been any changes that might influence the problem?
Many thanks for the help! Johann
I am not sure if v3.3.2 will solve it, but it's definitely a good shot. I will be using the most recent, v3.3.5, in order to recreate the error on our platform. So far I have all the necessary info that I need - Should I need something else I will contact you here.
In the meanwhile, I am happy to hear from your results with the more recent version!
Peter
o.k. find, I will drop a note in case I have news. 3.3.5 is not officially released, is it? At least there is no TM-version, which is the newer format, I read. Are from 3.3.2 to 3.3.5 any changes that might help?
I just realized that 3.3.5 [TM] was masked in the list of downloads still... It should appear now. Thank your for reminding me. Apart from that: No, I don't think there's a difference in version given that most changes from 3.3.0 to 3.3.5 concerned minor patches for problems related to Venus products.
Peter
o.k. fine, that means, no need to update immediately. If I can help, or you need input, just drop a note.
Thx, Johann
Hi Johann,
I ran multiple tests with the same inputs during the night and in none of the cases maja loops indefinitely. Maja does fail with the error message described but that is - as mentioned earlier - out of our reach for now.
Can you please tell me what OS you're running on and under which circumstances you call maja?
Also, could you please show me the content of PMC_LxREPT.EEF
once maja fails?
Kind regards, Peter
Hi Peter,
first of all, thanks for the tests. We are using:
CentOS Linux release 7.6.1810 (Core)
in a root shell on a Debian machine. We are using a slightly adapted version of 'start_maja.py'. I do not know the git commit. I cloned it about half a year ago. I only added a filter, which looks if a "L2NOTV_SENTINEL2.." exists and skips the product in case of a reprocessing and integrated it in our download script.
I also thought that the loop might come from our download script, but it seems that the maja-process is not interrupted/ restarted and there is no command-line output. That's why I thought the loop is within maja.
I do not have the PMC_LxREPT.EEF file. I deleted it because of disk space. But, I will try to reproduce it with a broken product. It will take a little bit of time because I think it makes sense to start from scratch and do the backward initialization before the broken product.
Thanks! Johann
Dear Peter,
I could reproduce the behavior on our machine. MAJA produces a 3.5T PMC_LxREPT.EEF file. I tested it with the product: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE
top output:
3623 sen-dow+ 20 0 5582668 4.6g 76116 R 498.3 3.7 579:06.03 maja-processing
the log file:
S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE.log
the first part of the PMC_LxREPT.EEF file:
PMC_LxREPT.EEF.first_1000_lines.txt
Thanks! Johann
Hi Peter, some more news. If you have a look at the first 20MB of the file, just the first 2000lines are xml, but the remaining MB are only blank:
PMC_LxREPT.EEF.first_20MB.txt.tar.gz
A mem leak?
Which system do you use? Libs?
Hi Johann,
I tried multiple systems this weekend and I cannot get this problem to reproduce on my end. Neither Centos 6/7 (the same 7.6.1810 you have) would show the same behavior. Do you run start_maja with sudo rights/in a docker?
Kind regards, Peter
Dear Peter,
first of all, thanks for your investigations. Our system is not in a docker container but a jail root (chroot) on a Debian host.
I tried to debug it, to find out where (and maybe which third party lib) loops. For this, I manually prepared a test setup with preprocessed products to only process the broken one. But my test failed, the loop did not happen, MAJA has correctly thrown an exception. It seems that the loop is an edge case, a combination of environment and libs, ... I will try the debugging experiment from scratch and drop a note.
Bests! Johann
Dear Peter,
I found another strange behavior:
the next 4 products had log files with zero size (no content).
Can that happen? Do you have an idea about the reason?
Thanks! Johann
Hi Johann,
While the log is empty, what about the L2A products themselves? Were they empty as well?
I never have encountered any of the behaviors you are showing me above - I have a slight suspicion that, as you mentioned, your environment might be at cause.
Peter
Hi Peter, sorry, for the inconvenience. To reproduce the behavior with a debugger also failed. It seems to be some nasty edge case. At least, now I know about the broken products and can handle them properly. I anyway have to do a cleanup. I will install a fresh system and hopefully, the loop does not come up again. Normally I am using Ubuntu, but at the time I installed the system, MAJA did not work out of the box in Ubuntu, so I switched to CentOS. Which is your preferred system? What are you using normally?
Maja should be working fine with both CentOS >6 and Ubuntu >16.04, which is what we use internally for testing.
If you have any problems with Ubuntu, then I recommend to open up a new issue related to the error.
Dear Peter,
we made a step forward, could reproduce the loop, and found out the point where it loops. It seems that it is a multi-threading problem. Maja tries to write to the log file PMC_LxREPT.EEF from multiple threads using pugixml v1.5, which seems to be not threadsafe according to the documentation. Could that be the case?
But let's go one step back. My colleague Chris, attached a debugger to Maja when it was looping and got the following callback:
#1 0x00007f7a00981b10 in pugi::xml_document::save(pugi::xml_writer&, char const*, unsigned int, pugi::xml_encoding) const () from /opt/maja/3.3.2/lib/libpugixml.so.1
#2 0x00007f7a00982202 in pugi::xml_document::save_file(char const*, char const*, unsigned int, pugi::xml_encoding) const () from /opt/maja/3.3.2/lib/libpugixml.so.1
#3 0x00007f7a0c0aa806 in vns::ReportLogger::Flush() () from /opt/maja/3.3.2/lib/libMajaDataCommon.so.3.3
#4 0x00000000004063dd in vns::Loggers::SendMessageToLog(std::string const&, vns::LoggerBase::LevelType const&, bool const&, char const*, char const*, unsigned long) ()
#5 0x00007f7a10a60d6a in vns::ComputeScatteringCorrectionImageFilter<otb::VectorImage<double, 2u>, otb::Image<double, 2u>, otb::Image<double, 2u>, otb::Image<unsigned char, 2u>, otb::VectorImage<unsigned char, 2u>, otb::VectorImage<double, 2u>, vns::VectorLookUpTable<double, 3u> >::ThreadedGenerateData(itk::ImageRegion<2u> const&, unsigned int) () from /opt/maja/3.3.2/lib/libMajaAlgorithmsL2.so.3.3
#6 0x00007f7a1105d0a1 in itk::ImageSource<otb::VectorImage<double, 2u> >::ThreaderCallback(void*) () from /opt/maja/3.3.2/lib/libMajaChain.so.3.3
pugi::impl::(anonymous namespace)::node_output loopes continously, which is from pugixml v1.5: https://github.com/zeux/pugixml/blob/v1.5/src/pugixml.cpp, Right?
It loops when it tries to write a "Message" Node xml_node_struct which is defined as:
00 uintptr_t header;
08 xml_node_struct* parent; ///< Pointer to parent
10 char_t* name; ///< Pointer to element name.
18 char_t* value; ///< Pointer to any associated string data.
20 xml_node_struct* first_child; ///< First child
28 xml_node_struct* prev_sibling_c; ///< Left brother (cyclic list)
30 xml_node_struct* next_sibling; ///< Right brother
The node when it loops is at [0x47ee9a60] and the name = 'Message', first_child is [0x47ee9ab0] with name = 'Date_Time'.
Parent of [0x47ee9ab0] is [0x47ee9a20], also a 'Message' node which has a next_sibling [0x47ee9a60], which is the original 'Message' node. That's why it loops.
It seems that the log message is written form libMajaDataCommon vns::ReportLogger::BuildDedicatedFormattedEntry. That's where we ended up, because of no more debug symbols. But because writing xml seems to be straight-forward, we think it is a multi-threading problem, when multiple threads try to write to the xml-dom.
According to the documentation, it seems writing with pugixml is not threadsafe:
3.4. Thread-safety guarantees
Almost all functions in pugixml have the following thread-safety guarantees:
it is safe to call free (non-member) functions from multiple threads
it is safe to perform concurrent read-only accesses to the same tree (all constant member functions do not modify the tree)
it is safe to perform concurrent read/write accesses, if there is only one read or write access to the single tree at a time
Concurrent modification and traversing of a single tree requires synchronization, for example via reader-writer lock. Modification includes altering document structure and altering individual node/attribute data, i.e. changing names/values.
The only exception is set_memory_management_functions; it modifies global variables and as such is not thread-safe. Its usage policy has more restrictions, see Custom memory allocation/deallocation functions.
In our case, multiple active threads loop in pugi::impl::(anonymous namespace)::node_output which is one more hint towards a multi-threading problem.
Does that make sense? Is Maja multi-threaded when it writes the error to the log-file?
Best regards!
Johann
Hi Johann,
First of all sorry for the late answer, here are my findings from the last weeks:
We will observe this behavior for Maja4, hoping that it won't show up again. Keep an eye on the projet page to check when it is available - You should be able to upgrade without much changes as the interfaces will stay the same (Params, GIPP, input folder etc.). I hope that helped you!
Kind regards, Peter
Hi Peter,
sounds good. Meanwhile, I switched from CentOS to Ubuntu 20.04 and Python 3, without any troubles. I also improved my Python-Script to avoid multiple trials of broken products. It seems that it minimized the chance of looping. At least it did not show up for a while.
I also integrated CAMS data, based on the provided download scripts. These scripts seem to be still in Python 2.7 style (print and one or two other issues). Should I prepare pull requests or is any way someone else working on it?
Thank's for all the debugging sessions and discussion:-) Best regards, Johann
Good to hear that it seems to be working now, and even better with CAMS! Updating the remaining scripts to python3 is on our list until release, so we should be covered.
Thank you for your help and the investigation! Peter
For two tiles (33UYQ, 32TQT), which were correctly processed before, MAJA fails now. It does not proceed, it loops and the size of PMC_LxREPT.EEF continuously increases. Other tiles are still processed correctly.
MAJA Version 3.3.0
Error message: vns::Business::ERROR: ComputeScatteringCorrectionImageFilter(0x20cf500): For band id '3' (nb channel='4') ' no miniLUT has been generated for this angle zone (detector) : '4'. Angles zones are [0,5,6,7,8,9,10,] !!!,Check coherency between metadata and input zone mask source. [vnsComputeScatteringCorrectionImageFilter.txx:ThreadedGenerateData:238]
Last correctly processed product: S2A_MSIL1C_20200310T094031_N0209_R036_T33UYQ_20200310T101526.SAFE
Product where MAJA fails: S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE
and does not further continue.
Log-file for the product where MAJA fails:
S2A_MSIL1C_20200313T095031_N0209_R079_T33UYQ_20200313T102505.SAFE.log