4dn-dcic / tibanna

Tibanna helps you run your genomic pipelines on Amazon cloud (AWS). It is used by the 4DN DCIC (4D Nucleome Data Coordination and Integration Center) to process data. Tibanna supports CWL/WDL (w/ docker), Snakemake (w/ conda) and custom Docker/shell command.
MIT License
70 stars 28 forks source link

plot_metrics isn't producing some plots #367

Closed nhartwic closed 2 years ago

nhartwic commented 2 years ago

Basically the title. My plot_metrics calls fail to produce the main figures. For reference, here is a recent "plot_metrics" result...

https://salk-tm-logs.s3.amazonaws.com/1JYcXdBkgBlp.metrics/metrics.html

I'm currently on snakemake 7.9.0 and tibanna 1.9.2. I have tried to upgrade to tibanna 2.0.0 but something seems to be wrong with pip. See error below...

$ pip install tibanna==2.0.0
ERROR: Ignored the following versions that require a different python version: 1.9.0.0b23 Requires-Python >=3.7,<3.9; 1.9.0.0b24 Requires-Python >=3.7,<3.9; 1.9.0.0b25 Requires-Python >=3.7,<3.9; 1.9.0.0b26 Requires-Python >=3.7,<3.9; 1.9.0.0b27 Requires-Python >=3.7,<3.9; 1.9.1.0b1 Requires-Python >=3.7,<3.9; 2.0.0 Requires-Python >=3.7,<3.9
ERROR: Could not find a version that satisfies the requirement tibanna==2.0.0 (from versions: 0.8.0b1, 0.8.0b2, 0.8.0b3, 0.8.0b4, 0.8.0b5, 0.8.0, 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.9.5b1, 0.9.5, 0.9.6b1, 0.9.6, 0.9.7b1, 0.9.7, 0.10.0, 0.10.1, 0.10.2, 0.11.0, 0.11.1, 0.11.2, 0.11.3, 0.12.0, 0.12.1, 0.13.1, 0.14.1, 0.15.0b0, 0.15.0, 0.15.1, 0.15.2, 0.15.3, 0.15.4, 0.15.5, 0.15.6, 0.15.7b0, 0.16.0, 0.17.0b0, 0.17.0b1, 0.17.0b2, 0.17.0b3, 0.17.0b4, 0.17.0b5, 0.17.0b6, 0.17.0, 0.17.1, 0.17.2b1, 0.17.2b2, 0.17.2, 0.17.3, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 1.0.0b1, 1.0.0b2, 1.0.0b3, 1.0.0b4, 1.0.0b5, 1.0.0b6, 1.0.0b7, 1.0.0b8, 1.0.0b9, 1.0.0b10, 1.0.0b11, 1.0.0b12, 1.0.0b13, 1.0.0b14, 1.0.0b15, 1.0.0b16, 1.0.0b17, 1.0.0b18, 1.0.0b19, 1.0.0b20, 1.0.0b21, 1.0.0b22, 1.0.0b23, 1.0.0b24, 1.0.0b25, 1.0.0b26, 1.0.0b27, 1.0.0b28, 1.0.0b29, 1.0.0b30, 1.0.0b31, 1.0.0b32, 1.0.0b33, 1.0.0b34, 1.0.0b35, 1.0.0b36, 1.0.0b37, 1.0.0b38, 1.0.0b39, 1.0.0b40, 1.0.0b41, 1.0.0b42, 1.0.0b44, 1.0.0b45, 1.0.0b46, 1.0.0b47, 1.0.0b48, 1.0.0b49, 1.0.0b50, 1.0.0b51, 1.0.0, 1.0.1b0, 1.0.1b2, 1.0.1, 1.0.2b0, 1.0.2b2, 1.0.2, 1.0.3.dev0, 1.0.3.dev1, 1.0.3.dev2, 1.0.3.dev3, 1.0.3.dev4, 1.0.4b0, 1.0.4b1, 1.0.4b2, 1.0.4b3, 1.0.4b4, 1.0.4b5, 1.0.4b6, 1.0.4b7, 1.0.4b8, 1.0.4, 1.0.5, 1.0.6, 1.0.7b0, 1.0.7b1, 1.0.7b2, 1.0.7b4, 1.0.7b5, 1.0.7b6, 1.1.0, 1.1.1.dev1, 1.1.1b0, 1.1.1b2, 1.1.1b3, 1.1.1, 1.1.2.dev1, 1.1.2.dev2, 1.1.2.dev3, 1.1.2.dev4, 1.1.2.dev5, 1.1.2, 1.1.3, 1.2.0b0, 1.2.0b1, 1.2.0b2, 1.2.0, 1.2.1, 1.2.2b0, 1.2.2, 1.2.3b0, 1.2.3b1, 1.2.3b2, 1.2.3, 1.2.4b0, 1.2.4b2, 1.2.4b3, 1.2.4, 1.2.5b0, 1.2.5b2, 1.2.5b3, 1.2.5, 1.2.6b0, 1.2.6, 1.2.7, 1.2.8, 1.3.0b0, 1.3.1b0, 1.3.1, 1.4.0b0, 1.4.0b1, 1.4.0b2, 1.4.0b3, 1.4.1, 1.5.0, 1.5.1b0, 1.5.1b1, 1.5.1b2, 1.6.0, 1.6.1b0, 1.7.0, 1.7.1b0, 1.7.1b1, 1.7.1, 1.7.2b0, 1.7.2b1, 1.8.0b0, 1.8.0b1, 1.8.0, 1.8.1, 1.9.0.0b0, 1.9.0.0b1, 1.9.0.0b2, 1.9.0.0b3, 1.9.0.0b4, 1.9.0.0b5, 1.9.0.0b6, 1.9.0.0b7, 1.9.0.0b8, 1.9.0.0b9, 1.9.0.0b10, 1.9.0.0b11, 1.9.0.0b12, 1.9.0.0b13, 1.9.0.0b14, 1.9.0.0b15, 1.9.0.0b16, 1.9.0.0b17, 1.9.0.0b18, 1.9.0.0b19, 1.9.0.0b20, 1.9.0.0b21, 1.9.0.0b22, 1.9.0, 1.9.1, 1.9.2)
ERROR: No matching distribution found for tibanna==2.0.0
willronchetti commented 2 years ago

What Python version are you using? Note in the output you provided:

 2.0.0 Requires-Python >=3.7,<3.9

I assume you are still on 3.6, which is past EOL. Try using a newer Python version? Also, the job log may be helpful here in deciphering the issue.

nhartwic commented 2 years ago

In my case, the issue was my python version was too new.

Upgrading didn't resolve the plot_metrics issue. Here is a new plot_metrics output...

https://salk-tm-logs.s3.amazonaws.com/aA6bccyAeLQt.metrics/metrics.html

job log top log

willronchetti commented 2 years ago

Seems to me like the issue is that your job is too short to extract the metrics for those plots from CloudWatch, and if it were longer you would see the plots. Odds are since it is very close to the CW reporting delta, you may not see any data points over that time frame (or the job finishes and makes the API call before CW has processed the data point and made it available). See these functions that pull that data. My guess is those API calls are returning no data points due to a very short job. You could write a small script that makes such calls and run them simultaneously while your job executes to verify.

With that said, if you go into CloudWatch and do in fact see metrics over the job time frame, we will look into it further. But it could also be the case that the data point (probably singular) just comes in too late for Tibanna to catch it. If you add a 15 minute sleep at the end of your job, I suspect the metrics will report as expected.

nhartwic commented 2 years ago

Ok, this job ran for an hour:

https://salk-tm-logs.s3.amazonaws.com/eTcBZv8sWlB8.metrics/metrics.html

...Here are all the "log_files" that were uploaed to s3 and are associated with the job...

job_eTcBZv8sWlB8.tar.gz

...Cloudwatch logs associated with the job below...

log-events-viewer-result.eTcBZv8sWlB8.csv

SooLee commented 2 years ago

Will, the cpu usage is coming directly from Cloudwatch whereas the rest is sent from the instance. Looks like the Perl script that sends these metrics no longer works. It could be a problem on the AWS side since they were planning to deprecate it.

Do you have this issue on your CGAP/4DN runs?

On Tue, Jul 26, 2022, 6:20 PM nhartwic @.***> wrote:

Ok, this job ran for an hour:

https://salk-tm-logs.s3.amazonaws.com/eTcBZv8sWlB8.metrics/metrics.html

...Here are all the "log_files" that were uploaed to s3 and are associated with the job...

job_eTcBZv8sWlB8.tar.gz https://github.com/4dn-dcic/tibanna/files/9193659/job_eTcBZv8sWlB8.tar.gz

...Cloudwatch logs associated with the job below...

log-events-viewer-result.eTcBZv8sWlB8.csv https://github.com/4dn-dcic/tibanna/files/9193664/log-events-viewer-result.eTcBZv8sWlB8.csv

— Reply to this email directly, view it on GitHub https://github.com/4dn-dcic/tibanna/issues/367#issuecomment-1196037820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHLO3FFB56A77SEI3KOYMTVWBQCFANCNFSM54UCIOQA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nhartwic commented 2 years ago

Not 100% sure what you are referring to but I don't think I use GCAP/4DN.

I assume tibanna uses some specific image when spinning up ec2 instances that includes tibanna and dependencies. If the issue is with perl, it seems like the sollution is updating that image?

alexander-veit commented 2 years ago

We are developing Tibanna for the CGAP/4DN projects.

This is a strange issue. I checked a few recent 4DN runs and the metrics reports look ok. However, I found one where it is incomplete as well: https://tibanna-output.s3.amazonaws.com/BuYiLLidEFPL.metrics/metrics.html (complete) https://tibanna-output.s3.amazonaws.com/lDZmmG1vMVnT.metrics/metrics.html (incomplete) Same workflow, same machine, both ran yesterday with the latest version of Tibanna for more than 5 hours, but the second one has an empty metrics.tsv and metrics_report.tsv.

nhartwic commented 2 years ago

It is possible there are multiple issues happening here. I haven't seen complete metrics on any of my jobs that I've checked. I'm not sure how many I've checked, but at least a dozen. I'm about to kick off a bunch of assemblies again so I'll check more jobs soonish.

willronchetti commented 2 years ago

Thanks for bearing with us on this. I can confirm we are also seeing this issue on our jobs (of all lengths). We're looking into it and will hopefully have a resolution soon.