Open stickelg opened 3 years ago
Hmm, this is pretty mysterious. If it's still an issue for you, would it be possible for you share a copy of your Coverage.xml and CoverageLog-*.xml files?
Sorry for the delay in response. Unfortunately, I am unable to share the coverage files due to my company's policies around IP. At first this looked like it may have been a configuration issue with php/apache in the container setup which was hard to debug, because all the relevant logs when normally debugging are configured to not work within the docker configuration by default. When setting a reasonable php/apache configuration for the coverage logs to upload, they worked. However, for a very short period of time. Now uploads can hit the keepalive setting in apache which is set to 300 (5 minutes) After this some coverage logs continue to fail to upload. This seems to be happening due to the upload handler, once it has the file, as the handler is having problems parsing the file and uploading it to the database before a script or keepalive time out. The size of the tables in the database are rather large in row count which can be hampering the upload process. For instance the number of rows are as follows:
coverage 6,273,672 coveragefile 4,475,465 coveragefilelog 2,065,516
We currently run coverage reports on a nightly basis where 40 files are uploaded. The way this is currently behaving this could take 2 to 3 hours in order to upload the files if not for the python script that uploads the files timing out itself after 2 minutes with the following error:
Error message was: Operation too slow. Less than 1 bytes/sec transferred the last 120 seconds
If the coverage files could be uploaded and return a 200 response and then later be parsed and submitted to the database as a sub-process or background task the uploads would be successful.
When debugging this issues we tried setting the autoremoveMaxBuilds option to 100 from 500 and this appeared to do nothing in terms of database cleanup. Running the cleanup and compress scripts on the upgrade.php page also provided little cleanup in terms of database size.
Have there been any similar use cases as to what I am describing and or scripts to clean up the database of older builds? When testing in a secondary development instance without any build history the uploads work successfully and before any timeouts. However, keeping the build history is preferential.
This seems to be happening due to the upload handler, once it has the file, as the handler is having problems parsing the file and uploading it to the database before a script or keepalive time out.
That's a good clue. If you haven't done so already, I recommend configuring CDash to parsing submissions asychronously by setting
$CDASH_ASYNCHRONOUS_SUBMISSION = true;
in app/cdash/config/config.local.php
.
When debugging this issues we tried setting the autoremoveMaxBuilds option to 100 from 500 and this appeared to do nothing in terms of database cleanup. Running the cleanup and compress scripts on the upgrade.php page also provided little cleanup in terms of database size.
CDash's default configuration does not delete old builds. Perhaps we should change that. In the meantime, here's the variable to set to turn this on, (again in config.local.php
):
$CDASH_AUTOREMOVE_BUILDS = true;
One way to check whether this is working as intended or not is to find your oldest build and see if it matches what would you expect based on your autoremove timeframe.
This took some triage for this to work. Our cdash server was an upgrade from an older pre docker version so the database had a lot of older records (over 12000) in the submissions table preventing new uploads from ever being reached even with adding additional workers as the older files no longer existed in the container within the backups directory. For now these records were just updated to have a status of 5 with the attempts set to the maximum, so they would be skipped but they could also be deleted. However, this was still not enough to get the uploads to work, the processsubmissions.php file was failing in the container.
First this file would fail due to the fact that the config had localhost used which would not work, the base-url needed to be used instead so we changed the config to use that instead. This was also preventing build cleanup from happening.
However, this then failed with the following in the docker log:
[2022-02-08 22:05:03] production.ERROR: Fatal error:Cannot redeclare AcquireProcessingLock() (previously declared in /home/kitware/cdash/app/cdash/include/submission_functions.php:23) {"function":"/home/kitware/cdash/app/cdash/include/submission_functions.php (23)","project_id":"2"}
This was fixed by updating the proccesssubmissions.php file and do_submit.php file to both use include_once
for the submission_functions.php include. I see this issue was discussed in issue #987 already with other suggestions for a remedy.
After that some records in the submissions table with the status of 1 would still fail to upload as they erred prior so they had to be set back to 0. Otherwise they would hang on the monitor.php page forever.
To note: to not lose uploads with asynchronous submissions we mapped a local directory into the container so the files persist, as the docker-compose setup needs to be re-run with every config change there is a high probability these files could be lost.
Also, some files would fail to parse because after this was set there may have been some resource starvation:
$CDASH_AUTOREMOVE_BUILDS = '1';
due to the current number of builds in the database.
This required another round of database modification and re-run of proccesssubmissions.php for everything to be finally uploaded and the monitor.php page to indicate that.
One thing we have noticed with the asynchronous uploads is that files uploaded are presenting with numerous copies of the same file with different time stamps appended and then some of the files are even further renamed to then have single quotes surrounding the name. Is this normal behavior?
When clicking to view coverage results, and drilling down into individual file results. Not all files are uploaded to the database resulting in a blank page with only the number 1 as the first line marker when looking to get information at the file level. When looking in the database with the following query here is a typical count for the number of missing files:
select buildid, count(*) as total, sum(case when f.file is null then 1 else 0 end) as no_file_total, sum(case when f.file is null then 0 else 1 end) as file_total from coverage c left join coveragefile f on c.fileid = f.id group by buildid order by buildid desc limit 10
buildid,total,no_file_total,file_total 2843,799,0,799 2840,4490,3691,799 2837,4492,3493,999 2830,7969,6966,1003 2829,199,0,199 2826,4478,4279,199 2824,899,0,899 2822,4478,3579,899 2817,4478,3479,999 2815,999,0,999
Is there a resource limit for the files to be uploaded, or something needed for configuration to support this consistently? Currently, running version v3.0.3-4-g3507be45 of cdash. The installation method used is docker on an AWS EC2(t3.xlarge, 150GB EBS 13% utilized) and RDS(db.m5.xlarge). Neither of the instances look to be taxed resource wise. I also do not see any logs being created from cdash in order to triage for errors.